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Abstract 

Universal compression of patterns of sequences generated by independently identically dis- 
tributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a 
sequence of indices that contains all consecutive indices in increasing order of first occurrence. 
If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding 
the unknown alphabet symbols can be exploited to create the pattern of the sequence. This 
pattern can in turn be compressed by itself. It is shown that if the alphabet size k is essentially 
small, then the average minimax and maximin redundancies as well as the redundancy of every 
code for almost every source, when compressing a pattern, consist of at least 0.51og(n/fc 3 ) 
bits per each unknown probability parameter, and if all alphabet letters are likely to occur, 
there exist codes whose redundancy is at most 0.5 log (n/fc 2 ) bits per each unknown probability 
parameter, where n is the length of the data sequences. Otherwise, if the alphabet is large, 
these redundancies are essentially at least O (n~ 2 / 3 ) bits per symbol, and there exist codes that 
achieve redundancy of essentially O (n -1 / 2 ) bits per symbol. Two sub-optimal low-complexity 
sequential algorithms for compression of patterns are presented and their description lengths 
analyzed, also pointing out that the pattern average universal description length can decrease 
below the underlying i.i.d. entropy for large enough alphabets. 
"This work was partially supported by the University of Utah, ECE Department, startup fund and NSF Grant 
CCF-0347969. Parts of the material in this paper were presented at the 40th Annual Allerton Conference on Com- 
munication, Control, and Computing, Monticello, IL, October 2-4, 2002, the IEEE International Symposium on 
Information Theory, Chicago, IL, June 27 - July 2, 2004, and the Data Compression Conference, Snowbird, Utah, 
U.S.A., March 23-25, 2004. 
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1 Introduction 

Classical universal compression [5] usually considers coding sequences that were generated by a 
source with a known alphabet but with some unknown statistics. In this paper, we consider the 
universal coding problem, where an independently identically distributed (i.i.d.) source generates 
data from an alphabet that is totally unknown to both encoder and decoder, and whose size k can 
grow with n. In this case, the cost of coding the alphabet letters is inevitable and depends strictly 
on the alphabet letters themselves. However, after coding of the alphabet letters, the data sequence 
can be uniquely represented by its pattern. The pattern of a sequence is a sequence of pointers 
that point to the actual alphabet letters, where the alphabet letters are assigned indices in order 
of first occurrence. For example, the pattern of the sequence "lossless" is "12331433". A pattern 
sequence thus contains all positive integers from 1 up to a maximum value k in increasing order of 
first occurrence, and is also independent of the alphabet of the actual data. One can separate the 
coding of the alphabet symbols from that of the pattern, and use universal coding techniques to 
encode patterns. The universal coding cost of a totally unknown alphabet is inevitable regardless of 
the code used, and depends strictly on the actual alphabet letters. Therefore, the more interesting 
universal coding problem becomes that of efficiently encoding the alphabet independent patterns. 

To the best of our knowledge, the idea of separating the description of the alphabet symbols from 
the representation of the pattern of a sequence for universal coding first appeared in the literature 
in [1]. This procedure was motivated in [1] by the multi- alphabet coding problem [41], i.e., the 
problem in which a sequence is generated by a known alphabet, but contains only a small subset 
of the alphabet letters. A separate description was used to inform the decoder which symbols from 
the alphabet have occurred in a sequence, and then their pattern was coded separately. However, 
no theoretical evidence was provided to show that such a technique has advantage over other 
multi-alphabet coding techniques, as those proposed in [41]. 

Stronger motivation for coding patterns of sequences was first given by Jevtic, Orlitsky, and 
Santhanam [13] (see also [17]-[21]), who motivated this problem by the problem of universal coding 
of sequences generated by sources over alphabets that are initially unknown to both the encoder and 
the decoder. The encoder then has to send the decoder complete information about the alphabet 
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letters, and can utilize this inevitable cost to improve the coding performance by representing the 
actual data sequence by its pattern. This problem can be strongly motivated by many practical 
applications that compress sequences generated by either a small or a large alphabet. For example, 
consider transmission of text in a language that was never seen before. The graphical structure 
of the letters must first be transmitted. If it is transmitted in the order of first occurrence, the 
pattern of the text can then be compressed. This application further motivates the problem of 
pattern compression over large alphabets because in text the natural alphabet unit can be a word 
instead of a letter. Another example is compression of sequences of species first seen on another 
planet. Since there is no prior knowledge of their forms, they can be designated by their pattern, 
i.e., the first specie encountered is number 1, the second number 2, and so on. 

The i.i.d. case is the simplest one to consider. However, coding of patterns whose underlying 
process is i.i.d. is different from coding of i.i.d. sequences because the constraints that are imposed 
by the definition of a pattern result in a non-i.i.d. probability mass function over the patterns that is 
different from the i.i.d. one of the original sequence. This allows shorter representations for patterns 
than those that would be used for the underlying i.i.d. sequences. Of course, this improvement is not 
free, and it only comes because of the inevitable price of representing the alphabet itself. However, 
while it was shown by Kieffer in [14] that if the alphabet size is very large (goes to infinity), no 
universal code exists, i.e., no code can achieve vanishing redundancy for i.i.d. sequences, this is not 
the case for the resulting patterns, as was first shown by Orlitsky et. al. in [13], [17]-[20], because 
only at most n letters of the actual alphabet appear in the pattern sequence. Furthermore, better 
universal compression performance is also possible in the case where the alphabet size k is sub- 
linear in n or even fixed. Moreover, even better non-universal compression is sometimes possible 
because every pattern represents a collection of many sequences, thus reducing the overall pattern 
entropy (see, e.g., [31], [34], [36], [38], [39]). 

The classical setting of the universal lossless compression problem [5] assumes that a sequence 
x n of length n that was generated by a source 6 is to be compressed without knowledge of the 
particular 6 that generated x n but with knowledge of the class A of all possible sources 6. The 
average performance of any given code, that assigns a length function £(•), is judged on the basis 
of the redundancy function R n (L, 0), which is defined as the difference between the expected code 
length of L (•) with respect to (w.r.t.) the given source probability mass function Pq and the nth- 
order entropy of Pq normalized by the length n of the un-coded sequence. 

Naturally, the lack of knowledge of the source parameters in universal coding results in some 
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redundancy when coding data emitted by any or almost any unknown source from a known class. 
To measure the universality of such a class, some notion of this redundancy is used to represent 
the best possible performance for some worst case, i.e., the redundancy expected from the best 
code for the worst case. This notion of redundancy thus serves as a lower bound on the worst case 
redundancy of any code for this class of sources. Two such notions are the maximin redundancy 
and the minimax redundancy, defined in Davisson [5]. In the maximin Bayesian approach, the 
parameter 6 is considered random, and the maximin redundancy is obtained by the worst distri- 
bution that maximizes the minimum expected redundancy, i.e., the worst distribution for the best 
code. The minimax approach considers the parameter to be deterministic, and defines the minimax 
redundancy as the redundancy of the best code for the worst choice of 6. A third stronger notion 
of redundancy for "most" sources in a class was later established by Rissanen [24]. This notion 
describes the performance of the best possible code for almost every source in the class except a 
subset of the class whose probability under the uniform prior (i.e., distribution in A) is negligible, 
and for which smaller redundancy can be obtained. A different approach to the study of universal 
codes is that of individual sequences. The minimax redundancy for individual sequences [40] is 
the redundancy of the best code for the worst possible sequence x n that can be generated by any 
source in the class. In this paper, however, we focus on average redundancies. 

Several publications [5], [6], [7], [10], [24] investigated the average redundancy performance in 
standard compression of classes of parametric sources and in particular i.i.d. sources over alphabets 
of size k, which are governed by k — 1 parameters. It was shown that for a finite size alphabet, each 
unknown probability parameter costs at least 0.5 log n extra redundancy bits. This lower bound 
applies in all average senses: minimax and maximin (which were demonstrated to be identical), and 
for almost all sources in the class. It also applies in the minimax individual sense. Furthermore, 
it was shown to be achievable, and in particular by using a linear complexity (fixed per symbol) 
sequential coding scheme that combines the universal mixture based Krichevsky-Trofimov (KT) 
probability estimators [15] with arithmetic coding [25]. Recently, [29], [30], [33], we extended the 
average results and showed that if the alphabet size is allowed to grow sub-linearly with n, each 
probability parameter costs 0.51og(ra/£:) bits in all average senses, and also this redundancy is 
achievable even sequentially with the KT estimators. At the same time, related results have been 
independently obtained for the individual case by Orlitsky, Santhanam, and Zhang [19], [21]. 

While standard universal compression, in particular that of i.i.d. sources, has been extensively 
researched, the problem of compression of patterns has only been addressed recently, with focus, 
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until now, only on the individual sequence case. The initial work on this problem was presented 
in [1]. However, Jevtic, Orlitsky, Santhanam, and Zhang [13], [17]-[21] have recently achieved 
significant progress in understanding this problem. In particular, they considered the performance 
of the best universal code for the worst sequence over all possible patterns generated by underlying 
i.i.d. sequences of length n. Using combinatoric techniques, they have shown that the minimax 



achieves the order of the upper bound and a sub-optimal computationally heavy low complexity 



In this paper, we focus, unlike previous work, on the average redundancy performance of uni- 
versal codes for coding patterns. We also consider the different behavior for different alphabet 
sizes k, and investigate the actual description length required for patterns. First, lower bounds 
on the average minimax/maximin redundancies are obtained as a function of the alphabet size k. 
(These bounds naturally apply also to the worst case individual redundancies.) Then, we derive 
lower bounds on the redundancy for most sources. Next, we obtain upper bounds on the redun- 
dancy focusing on the case in which all actual alphabet symbols are likely to be observed in the 
coded sequence. Although we use techniques that are much different from those used in [1], [13], 
[17]- [21] for the derivation of the minimax lower bound and the upper bound, the average case 
results we obtain in this paper demonstrate similar behavior of the redundancy in the average 
cases to that of the individual worst case. This is very important, because it demonstrates that 
the expected behavior for the worst setting is not much better than the worst sequence behavior. 
Hence, when coding patterns, like when coding standard sequences, one cannot expect to perform 
significantly better for the worst source than the performance for the worst sequence. Next, two 
sub-optimal low-complexity sequential algorithms are presented. The actual description length of 
these algorithms is studied (where the displacement relative to the i.i.d. source entropy, defined as 
the modified redundancy for patterns is considered). The description length for these algorithms 
demonstrates an interesting result, where the pattern entropy for large enough alphabets must de- 
crease compared to the i.i.d. one. Subsequently to the work presented here (see also [36]), pattern 
entropy and entropy rate have been extensively studied, first in [34], and later in [11]-[12], [22]-[23], 



To derive the lower bounds, we use the relations between redundancy and capacity that are 
presented in Section 3 based on [5], [9], [16]. The minimax/maximin bound we obtain for larger fc's 




sequential algorithm that achieves redundancy of O (n 1 / 3 ) bits per symbol. 



[31], [38]-[39]. 
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is larger than that obtained for most sources. This is because we must use different techniques to 
derive the two bounds, where the more demanding conditions to obtain the bound for most sources 
result in a smaller bound. This hints to the fact that it may be possible that in the case of patterns, 
it may cost more redundancy beyond the entropy to code the worst source than it costs to code 
most other sources in the class. The upper bounds are obtained by a constructive approach. For 
small fc's it combines Rissanen's approach [24] with our recent approach from [30], [33] and with 
the more demanding conditions in coding patterns. 

For readability and convenience, each of the sections that contain heavy analysis is structured 
such that the results and their properties are described first. Then, a short description of the 
structure of the proof is given. Finally, each such section is concluded with the technical proofs, 
where steps that require much technical detail are relegated to appendices. 

The outline of the paper is as follows. In Section 2, we define the notation. Section 3 reviews the 
individual sequence results of coding patterns, and the techniques we use to derive the new results. 
Section 4 summarizes the main results in the paper. Sections 5 and 6 contain the derivations of the 
minimax/maximin lower bounds and the bounds for most sources, respectively. In Section 7, we 
derive upper bounds on the redundancy with focus on the class of sources for which all symbols are 
likely to be observed. In Section 8, we present the sequential algorithms and study their description 
lengths and their displacements from the i.i.d. entropy. Then, in Section 9, a discussion about the 
results is presented. Finally, some concluding remarks are brought in Section 10. 

2 Notation and Definitions 

2.1 Universal Coding 

Let x n = (xi,X2, ■ ■ ■ ,x n ) denote a sequence of n symbols over an unknown alphabet S of size 
k. The class of all i.i.d. sources that can generate any sequence x n over S will be denoted by A. 
The subclass of i.i.d. sources that generate up to k alphabet symbols will be denoted by A*.. The 
subclass of sources that generate k symbols that are likely to be observed with probability greater 
than \—o{k/n) will be denoted by A&. A parameter 6 £ A& is a vector of k— 1 probability parameters 
6 = (6*i, 82, ■ ■ ■ , Ok-i)- For convenience, we will sometimes use the constrained component 6^ of 6. 
All k components of are non-negative and sum up to 1. In general, boldface letters will denote 
vectors, whose components will be denoted by their indices in the vector. We will use hat to denote 
the Maximum Likelihood (ML) estimator of a parameter obtained from the data sequence x n , e.g. 
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9 will denote the ML estimator of 9. Capital letters will denote random variables. 

Let 9 G Afc be a parameter vector that determines the statistical parameters of some source in 
the class Let x n be a sequence of n symbols generated by the source 9. The average nth-order 
redundancy obtained by a code that assigns length function L(-) for source 9 is defined as 

R n (L,9) = ^EgL[X n }-Hg[X], (1) 

where Eg denotes expectation w.r.t. the parameter 9, and Hg [X] is the (per-symbol) entropy of 
the source. (We will also use Hg [X n ] as the nth-order sequence entropy of 9, where in the i.i.d. 
case, Hg [X n ] = nHg[X\.) It has been established in the literature (see, e.g., [15], [16], [24]) 
that assigning a universal probability Q (x n ) is identical to designing a universal code for coding 
x n , because entropy coding techniques can be used to code the sequence using a number of bits 
that equals, up to integer length constraints, to the negative logarithm to the base of 2 of the 
assigned probability. In particular, one can use arithmetic coding [25] to allow sequential coding 
with sequential probability assignment schemes. We will thus ignore integer length constraints, and 
in places consider the redundancy as a function of the probability assignment scheme Q (•) instead 
of the code L (•). 

We can also define the individual sequence redundancy (see, e.g., [40]) of a code with length 
function L (•) per sequence x n as 

R n (L, x n ) = -{L (x n ) + log P ML (x n )} , (2) 
n 

where the logarithm function is taken to the base of 2, here and elsewhere, and Pml (x n ) = P e (x n ) 
is the probability of x n given by the ML estimator 9 of the governing parameters. The negative 
logarithm of this probability is the smallest possible code length for a particular sequence under a 
given statistical model (in our case the i.i.d. one). 

The average minimax redundancy of the class is defined as 

i?+(A fc ) = min sup R n (L,9). (3) 
L 9eA k 

Similarly, we can define the individual minimax redundancy as that of the best code L (•) for the 
worst sequence x n , i.e., 

R+ (A fc ) = min sup max - {L (x n ) + log Pg (x n )} . (4) 
L 6>eA fc x " n 
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To define the maximin redundancy of A&, let us assign a probability measure (prior) w (■) on 
Afc and let us define the mixture source 

P w (x n )= I w(d6)Pg(x n ). (5) 

JA-k 

The average redundancy associated with a length function L (■) is defined as 

#n(L,™) = / w(dO)Rn(L,0). (6) 

The minimum expected redundancy for a given prior w (which is attained by the ideal code length 
w.r.t. the mixture, L (x n ) = — \ogP w (x n )) is defined as 

R n (w) = min R n (L, w) . (7) 

L 

Finally, the maximin redundancy of the class is the worst case minimum expected redundancy 
among all priors w, i.e., 

R-(A k ) = sup R n (w). (8) 

w 

2.2 Patterns 

The pattern of a sequence x n will be denoted by (x n ). Many different sequences over the 
same alphabet (and over different alphabets) have the same pattern. For example, for the se- 
quences x n = "lossless", x n ="sellsoll", x n ="12331433", and x n ="76887288", the pattern is 
^ (x 11 ) ="12331433". Therefore, for given £ and 6, the probability of a pattern induced by an i.i.d. 
underlying probability is given by 

P, [*(*»)]= £ Pe(y n ). (9) 

We note that the probability in (9) is dominated by some of the sequences, where others only 
contribute negligibly. This fact is used to derive an upper bound in Section 7. The per sequence 
(block) pattern entropy of order n of a source 6 is thus defined as 

H e [* (X n )] = - £ [* (*")] log [* . (10) 

#(rr n ) 

In order to define the redundancy function of patterns for a given code and a given source 0, we 
need to realize that a vector 6' that is a permutation of another vector 6 produces similar typical 
patterns, and is, in fact, the same source in the pattern domain. Therefore, we can define the 
notation if) (0) as the permutation of 6 which is ordered in non-decreasing order of components, 
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i.e., Vi(0) <">h(0) < ... < tpk(0). For example, if 6 = (0.7,0.1,0.2), then ij> (0) = (0.1,0.2,0.7). 
We can also, alternately, view a vector er as a permutation vector of indices, and use (ai) to 
denote the ith component of the permuted vector 6, permuted according to er. For the example 
above, if we define er = (3, 1, 2), then 6 (er) = (0.2, 0.7, 0.1) and 6 (0-2) = 9\ = 0.7. In most sections, 
we will consider the original vector 6 to be already ordered non-decreasingly, and therefore the 
identity permutation er = (1, 2, . . . , k) will give i/> (0) = = (er). All vectors ip (0) for all G A& 
will constitute the pattern space vp (A&), and similarly, we can define vp (A) as the pattern space 
induced by (or projected from) the class A. 

The average pattern redundancy for coding patterns generated by a source using a code that 
assigns a representation of length L [vp (x n )] to the pattern of sequence x n is defined as 

R n [L, $ (0)} = X -E e L [\P (X n )} - hi e [* (A™)] . (11) 

Similarly to (2), we can define the individual pattern redundancy for a given code as 

Rn [L, vl/ (x n )} = 1 |l [VP (x")] + max{logP e ( s ")]}| . (12) 

Note that the ML probability is now different from that for the simple i.i.d. case, because the ML 
is taken over the pattern probability and not over the i.i.d. one. 

Even in the simplest i.i.d. underlying case, it becomes very difficult to derive closed form ex- 
pressions beyond (9) on the probability of a pattern, (except for very specific patterns). It will 
therefore be useful to define quantities that relate a code length to the i.i.d. entropy in the average 
case and to the i.i.d. ML probability in the individual case. We will refer to these quantities as the 
modified redundancies. The modified redundancy will be studied in Section 8, as part of the study 
of the description length of the proposed sequential schemes. The average modified redundancy for 
a code L (•) that codes patterns of a source 6 is defined as 

R n [L, </> (0)] = \E e L [* (A n )] - H e [A] . (13) 

The individual pattern modified redundancy is defined as 

R n [L, vP (/)] = X - |l [vP (x n )} + max{logP e [x"]}| . (14) 

We should note that unlike the regular redundancy, the modified redundancy does not actually 
satisfy conditions that must be satisfied by redundancy functions. In particular, it can be negative 
also in the average case. If this happens, it simply means that one can universally describe pat- 
terns using shorter descriptions than the entropy of the underlying i.i.d. source. We will see this 



phenomenon in Section 8 and in [31]. The modified redundancy thus becomes handy for bounding 
the description length a code can assign to a pattern, i.e, 

EgL [VP (X n )} = H e [\f (X n )} + nR n [L, if, (6)} = H e [X n ] + nR n [L, </> (6)} , (15) 

and we can use either equalities to bound this description length. 

Using the definition of the average pattern redundancy in (11), we can replace R n (L,6) by 
R n [L,tp (0)] in (3) to define the average minimax pattern redundancy i?+ [vp (A^)]. Similarly, we 
can define the average maximin pattern redundancy R~ (A&)] by the same substitution in (6). 
Taking the maximum of (12) on x n and the minimum on L(-), similarly to (4), we obtain the 
individual minimax pattern redundancy R+ [* (A&)]. Note that all these redundancies can also be 
obtained w.r.t. the class of all i.i.d. sources A regardless of the alphabet size. Naturally, R+ [ty (A)], 
R~ [if (A)], and R+ [* (A)] will take the maximal redundancy value over all alphabet sizes k. 



3 Technical Background 

3.1 Individual Pattern Redundancy 

To the best of our knowledge, universal compression of patterns was first introduced by Aberg, 
Shtarkov and Smeets [1]. Aberg et. al. addressed the compression problem of individual pattern se- 
quences. In particular, they used the individual sequence minimax approach developed by Shtarkov 
[40] to design the best code in the individual minimax sense. For standard sequence compression, 
this approach assigns to an n-symbols sequence x n probability Q (x n ) that equals its ML probability 
normalized by the sum of the ML probabilities over all possible sequences, i.e., 

<i6> 

where Pml (x n ) is the ML probability of x n , and the summation is over all possible sequences 
y n of length n. This approach guarantees (under negligible integer length constraints) individual 
redundancy of 

«• «• **> " \ log = k "*{|>i *">} <"> 

for every sequence x n . Equation (17) is true in particular for the worst sequence x n for which this 
redundancy is the minimal attainable. Therefore, this approach achieves the minimax redundancy. 
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The approach above was modified for patterns by modifying (16) to 

Q [ * {xn)] = V jweW (18) 

where P^jfj (^ n )] is the ML pattern probability for the pattern of the sequence x n , and the nor- 
malization factor is the sum of all ML probabilities for all possible patterns of sequences generated 
by sources £ Aj,. Restricting the derivation to Aj, (and not the wider i.i.d. class A), it was shown 
in [1] that the normalizing sum is approximately lower bounded by 

where F (•) is the Gamma function. If further analysis steps are performed beyond those in [1], this 
yields a lower bound on the individual minimax pattern redundancy for patterns with at most k 
different alphabet symbols of 

^[* ( A,)]>(l- £ )^-HlogJ, (20) 

where e > can be made arbitrarily small. This bound is, of course, useful only for k = o (n 1 / 3 ) 
and becomes negative for larger alphabet sizes. Based on this result and prior results in [41], Aberg 
et. al. also proposed a sequential scheme for coding patterns, for which they provided empirical 
results. The computational requirements of this scheme appear to be rather demanding. 

Major progress in the research of individual pattern compression has been recently obtained 
by Jevtic, Orlitsky, Santhanam, and Zhang [13], [17]- [20]. The approach used in those papers was 
similar to that in [1] based on Shtarkov's minimax results and on combinatoric techniques. These 
papers considered the compression of patterns generated by any source from the whole class A, 
independently of the alphabet size k, i.e., the maximum number of different indices in the pattern. 
First, it was shown [13] that probability assignment of 

Qin^ y PMLix 7 ML(r y < 21 > 

Z^*(j/«) : ee*(A) r «i \V ) 

where Pml {x n ) is the i.i.d. ML probability (not the pattern ML probability) but the summation 
is only on all possible patterns, results in modified individual redundancy of 

+ (22) 



This redundancy is obtained for every pattern of length n independently of the number of indices 
in the pattern, and is also the minimax modified individual pattern redundancy. Then, Orlitsky 
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et. al. [17]- [20] demonstrated that this modified redundancy is, in fact, a lower bound on the actual 
pattern redundancy. (Note that if the analysis in [1] is modified to the whole class A, one can 
obtain the same bound.) Using integer partitioning of a sequence of length n, it was also shown 
in [17]- [20] that there exist codes that achieve individual minimax pattern redundancy of at most 
Summarizing all these results, it was shown that there exist codes for which 

1.5 log e / 1 \ «, FT .... 7r a/2/3 log e 

' • ' < R+ [V (A)] < v ' . (23) 



n 2 / 3 \n 2 / 3 / \jn 

Finally, a computationally demanding high complexity sequential scheme was shown in [18]- [20] to 
achieve the order of the upper bound in (23), as well as a low-complexity sequential scheme that 
achieves minimax individual redundancy of O (n -1 / 3 ). 

3.2 Average Case - Background 

Unlike the prior results on compression of patterns, we focus on the average case problem in 
compression of patterns induced by sequences generated by i.i.d. sources. To derive lower and upper 
bounds, we will use techniques that are based on Davisson's [5] and Rissanen's [24] approaches, and 
their extension [9], [16]. In particular, the well established connection between universal coding 
redundancy and channel capacity will be used to obtain lower bounds on the average pattern 
redundancy. In [5], it was established that the maximin redundancy of a class A& is bounded 
from below by (and asymptotically equals to) the normalized capacity of the "channel" defined by 
the conditional probability Pg(x n ), i.e., the channel whose input is the parameter 6 and whose 
output is the data sequence x n . It was further established that the average minimax redundancy 
is lower bounded by the maximin redundancy. Using Gallager's later result [10] that shows that 
the minimax and maximin redundancies are essentially equivalent, this leads to the bound on both 
minimax and maximin redundancies of 

i?+ (A fc ) = R- (A k ) > sup -I w (0; X n ) , (24) 

w fl 

where I w (&; X n ) is the mutual information induced by the joint measure w (6) ■ Pq {x n ). Using 
(24), any lower bound on the capacity of the channel defined by Pq (x n ) can be used to bound the 
minimax and maximin redundancies. In particular, one can pick a set fi of M points 6 £ A&. If 
these points can be shown to be distinguishable by the sequence X n , then (log M)/n can serve as 
a lower bound on the normalized capacity of the respective channel, and thus on the minimax and 
maximin redundancies. This lower bound is specifically implied by Fano's Inequality using the fact 
that the error probability goes to (see, e.g., [16]). Distinguishability in a set of points fl is defined 
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(in a stronger sense than needed to the result above) as follows. Let 6 6 fi be a point that generates 
the random sequence X n . Let 6 = f (X n ) be an estimator of 9 from X n , and let 6q = g (j)^J be 
a point in Q that is used to estimate from the estimator 0, where is not necessarily a point 
in ft. Then, there exist functions /(■) and g(-), such that Pg / — > as n — > oo, for every 
6 6 fl. In words, there exists an estimator of 9 out of the points in fi, such that the probability 
that a sequence that was generated by one point in the set would appear to have been generated 
by a different point in the set vanishes with n. 

The approach described above will be adopted to patterns in order to derive the bound in 
Section 5. In the patterns case, we will consider the set of sources £ ^(O), and the pattern 
source estimator tp (o) will be defined as a function of the pattern, i.e., xj) (o) = f (X n )], since 

V (e 



the sequence itself is not observed. Then, the estimator 9q = g 



must be in the pattern 



source space \I/ (fi). Since the minimax and maximin average redundancies are essentially the same, 
we will consider only the minimax one, and the results will apply to both. 

Merhav and Feder [16] extended the concept of the redundancy-capacity and derived a strong 
version of the redundancy-capacity theorem. They showed that if it is possible to partition the 
class Afc into disjoint sets of sources 0, each of at least M points that are distinguishable by X n , 
then the redundancy is lower bounded by 

loeM 

Rn(L,e)>(l-e)^—, (25) 

for every code L(-), and almost every £ where e > is arbitrarily small. In order to be 
able to use this result, one needs to make sure that the points in each set are uniformly distributed 
within the set, and every point in is included in one set (see also [27]-[29]). Sometimes such an 
assumption cannot be made unless a non-uniform prior is assumed within the class. In such cases 
the result in (25) does not apply to most sources in the class, but to all sources in the class except a 
subset whose probability under the prior assumed vanishes. The technique that will be presented in 
Section 5 for patterns will suffer from this problem, and thus cannot be used to obtain a lower bound 
on the redundancy of most sources. Therefore, a different technique that uses Merhav and Feder's 
theorem will be applied in Section 6 to derive a lower bound for most sources. As in Section 5, the 
ideas described in this paragraph for standard compression will be applied to patterns in a similar 
manner to that described in the preceding paragraph. 

Both versions of the redundancy-capacity theorem presented above can be used by taking grids 
of points from the class A^, and showing that the points in each grid are distinguishable. Then, the 
normalized logarithm of the number of grid points gives a lower bound on the required redundancy. 
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For the minimax redundancy, one such grid is sufficient using the weak version of the theorem. 
For the redundancy for most sources, we need to show how we shift the grid to cover the whole 
class without violating the conditions of the strong version of the redundancy-capacity theorem, 
where the points in each shift of the grid remain distinguishable. For standard compression with 
fixed alphabet size k, a uniform grid with spacing of jt,- - 5 ^- 5 ) for an arbitrarily small e > 
is sufficient for distinguishability. This yields the well known bound, for which the cost of each 
unknown probability parameter is 0.5 log n bits. Recently, we showed [30], [33] that in the case of 
large alphabets, the simple grid used to achieve the fixed k bound is not sufficient. In the minimax 
case, a non-uniform grid with increasing spacing in each dimension was created, and resulted in 
a cost of 0.51og(n/fe) bits for each unknown probability parameter. The same cost with smaller 
second order term resulted for most sources using sphere packing [2] considerations to create a grid 
(or lattice) of distinguishable points. (Note that this idea is in line with Rissanen's proof for a 
parametric source with a finite number of parameters [24].) The ideas that led to these bounds will 
be modified in Sections 5 and 6 for lower bounding the minimax and most sources redundancies of 
patterns. 

In [9], Feder and Merhav showed that there exist classes that consist of different subclasses, 
each with different redundancy within itself. For example, a union of subclasses constitute the 
class A. If all the subclasses are coded as one class, the redundancy adapts to the worst one among 
the subclasses even if the actual source is from a subclass within which smaller redundancy can be 
obtained. However, in most simple cases, the cost of distinguishing between subclasses is negligible 
w.r.t. the universal cost within each subclass. Hierarchical coding first distinguishes between the 
different subclasses and then between sources within each subclass. For example, if the class A is 
considered, the encoder will first code the alphabet size k and then perform universal coding within 
the subclass A^. Such an approach yields lower costs for coding sources in many subclasses than 
the cost of coding the whole class. Therefore, unlike the results in [13], [17]- [20], we will consider 
the subclass ^ (A*.) and analyze the pattern redundancy for each k. If k is initially unknown, 
(1 + e) log A; bits can be used to relay to the decoder the number of indices in the pattern using 
Elias's [8] coding of the integers. 

One technique that will be used in Section 7 to design a code for coding patterns will use 
ideas as in Rissanen's quantization two-part code method [24]. This technique estimates the ML 
parameters from the sequence X n and then quantizes them onto a grid of points. Then, only 
the quantized version of the ML parameters is relayed to the decoder, and entropy coding is used 
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w.r.t. this version as if the quantized parameters are the true source parameters. The redundancy 
of this code consists of the cost of relaying the quantized ML estimators and the cost caused by 
the quantization of the ML parameters. The latter results from the deviation of the quantized 
parameters from the actual parameters. Usually, the quantization cost can be made negligible by 
tuning the grid spacing properly. Unlike Rissanen's approach, we will need to use a non-uniform 
grid for the quantization, as in [30], [33], although, unlike these references, we will be concerned 
with index probabilities for patterns and not the actual letter probabilities. 



4 The Main Results 

The paper contains the following main results: 

• a lower bound on the maximin and minimax redundancy for universal coding of patterns, 

• a lower bound on the redundancy for most sources when coding patterns, 

• an upper bound on the redundancy of coding patterns, specifically for not very large alphabets 
where all alphabet letters are likely to occur in a sequence, 

• two sub-optimal sequential low-complexity methods for coding patterns with upper bounds 
on the displacements of their description lengths from the i.i.d. ML description length and 
also with implications to the pattern entropy. 

Each of the above results is studied in a separate subsequent section. 

In particular, we show that the nth-order maximin and minimax average universal coding 
redundancies for patterns induced by i.i.d. sources with alphabet size k are lower bounded by 

1/3 



Rt[*(A k )]>{ 



^log^ + ^log^-0(^), forA:<(^) 
(f) 1/3 -(1.51oge)-n"( 2 + £ )/ 3 -0(^), ioi k > {^f-^ 
The nth-order average universal coding redundancy is lower bounded by 

logff-O(^), forfc<i / ~ 1 - eXV3 



(26) 



fc-l i n L - £ k-1 



^ M*.n-(-)/3_ (^), for fc >I-(^) 1/3 

for every code L(-) and almost every i.i.d. source 9 6 Both lower bounds demonstrate that for 
small k, each parameter costs at least 0.5 log (n/fc 3 ) bits. For larger alphabets, the cost is at least 
O (n^- 2 )/ 3 ) bits overall. 
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Next, it is shown that there exist codes with length function L* (■) that achieve redundancy 



for patterns induced by any i.i.d. source £ A&. Namely, for small k, each parameter costs at most 



Next, a linear (per sequence) complexity sequential method (with prior knowledge of k) is shown 
with modified individual redundancy that satisfies 



for every pattern (x n ) of a sequence x n with k distinct indices and for every k <n. With increased 
complexity, identical performance is also achieved without prior knowledge of k. However, a second 
linear complexity scheme achieves similar asymptotic performance in k, with only second order 
penalty without prior knowledge of k. Finally, the implications of these bounds on the pattern 
entropy are noticed, in particular, indicating that the pattern entropy must decrease from the i.i.d. 
one if k is larger than cn 1 / 3 , for some constant c. 

5 A Maximin and Minimax Lower Bound 

In [30], [32]- [33], it was established that for a large known alphabet of size k, choosing a set fi 
of M sources 6 whose k — 1 free components are placed only at points on a non-uniform grid 
of increased spacing in each dimension yields a set of distinguishable sources if the grid spacing 
is properly chosen. The k — 1 components of grid points take values only from the grid vector 
r = (t±, T2, ■ ■ ■ , Tb, ■ ■ ■ , tb)- The components of r satisfy t\ < t<i < ■ ■ ■ < Tf, < • • • < tb, and 
the spacing between every two consecutive components increases with b. The advantage of such a 
grid is that it yields a tighter bound on the redundancy, as we can include more points in regions 
of Afc in which closer points are distinguishable, i.e., for small probability parameters. For coding 
patterns, we can build a similar grid of sources. However, we need to verify distinguishability in the 
pattern domain, as explained in Section 3. A valid grid point 6 G and a non-identity permutation 
0' = 6 (er) / 6, 6' G ft, of 6 will not be distinguishable in the pattern domain, as they are likely 
to generate similar patterns. Hence, in order to build a grid of sources which are distinguishable 
in the pattern domain, we can take the grid Cl for i.i.d. sources, but keep only one point for each 
set of permutations of the same source vector 6 = (0) (which is ordered in non-decreasing order 




(28) 



0.5 log (n/k 2 ) bits, and for large k, O (\/n) overall. 



R n [Qk , « (*»)] < A log £ + (i| log e ) t _ _L log n + o (5) . 



(29) 
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of components). We then consider a new grid <!/ (ft) that contains only points = if) (0) in which 
the components are ordered in nondecreasing order. In this section, we will show how such a grid 
can be obtained, and then will use the weak-version of the redundancy-capacity theorem to derive 
a lower bound on the minimax redundancy using this grid. We start by stating the main result 
that lower bounds the minimax pattern redundancy, and then present its proof 1 . 

Theorem 1 Fix an arbitrarily small e > 0, and let n — > oo. Then, the nth-order maximin and 
minimax average universal coding redundancies for patterns induced by i.i.d. sources with alphabet 
size k are lower bounded by 

1/3 

(30) 



R+[*(A k )]>{ 



(f ) 1/3 ■ (1.5 log e) • n~( 2 + £ )/ 3 - O (^) , for k > {^f-) 

Theorem 1 shows that as long as k is small (of o(n 1 / 3 )), each index probability parameter 
costs at least 0.5 log (n//c 3 ) extra code bits. However, if the alphabet size is larger, a threshold 
phenomenon occurs, and the redundancy is of O (n~ 2 / 3 ) overall. Note that this result applies even 
if k > n, because regardless of the actual alphabet size, the number of indices that will occur in a 
pattern is upper bounded by n. The bound in the first region coincides with the individual minimax 
bound obtained from [1], described in (20). The second region points to the same behavior as the 
worst case bound in (23). The average lower bound naturally applies to the individual minimax 
worst case redundancy, but not the other way around. Theorem 1 shows that we are unlikely to 
gain much in the average case over the worst sequence at least for the minimax redundancy. The 
n - £ / 3 g a p may indicate a true small gap between the individual worst case and the average worst 
case, but may also be due to sub-optimal bounding. 

The proof of Theorem 1 builds a non-uniform grid ft of points as in the i.i.d. minimax case. 

Then, the grid size is reduced by a factor of k\ eliminating all permutations of any grid point 

except the ordered permutation if)(0), resulting in a new grid * (ft) in the induced patterns 

space. This elimination is a worst case one, since sources for which there are identical components 

9i = 9j for j ^ i have less than k\ permutations in the original i.i.d. grid. The elimination of 

more grid points than necessary becomes significant for k = O (n 1 / 3 ) or larger. For alphabets of 
lr The initial derivation of a related bound to that of (30) appears in [32], and was done subsequently to the 
derivation of the individual sequence minimax lower bound in [20] (see, e.g., [17]). The bound in [32] was later 
improved. A problem with the second region of both bounds (in [32] and the improved one) was pointed out by 
Ortlitsky and Santhanam in October 2003. Consequently, the improved bound and its proof were corrected resulting 
in the second region of (30). 
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these sizes, most distinguishable grid points in the i.i.d. standard compression grid contain identical 
components. Therefore, we reduce the bound on the grid size by a factor that is too large. This 
results in a useless bound that is smaller than 1 on the number of grid points M in the pattern 
grid, and requires adaptation of the largest bound on M as a function of k to all large fc's. 

A second issue that needs to be addressed in order to use the weak version of the redundancy- 
capacity theorem is that of distinguishability of the grid points in the pattern domain, as described 
in Section 3. Although the grid we will use is a subset of the distinguishable i.i.d. grid, we need 
to have distinguishability in the pattern domain, i.e., if point 6 generated the sequence X n , the 
pattern ^ (X n ) needs to appear as if it were generated by ij) (0). If we observe the sequence X n 
and obtain the ML estimator 6 of in the i.i.d. domain, we may have sequences for which 9i > 9j 
for j > i. For such sequences, 0q may still be equal in the i.i.d. domain. In the pattern domain, 
however, if this happens, by observing ^ (A n ), 6i will appear to be the estimate of 6j and 9j of 
Q{. We thus need to show, that despite that, distinguishability is still maintained, and thus by the 
restriction that n 6 ^ (ft), we will still have 6 n = ip (6) for all cases in which 6q = 6. This will 
be done as the last step of the proof of the theorem. The proof of Theorem 1 follows and concludes 
this section. 

Proof of Theorem 1: Let A^ be the class of i.i.d. sources with an alphabet of size k, and let 
(Afc) denote its induced pattern class. First, let us consider a non-uniform grid ft of points in 
A*.. Also, at this point, let us assume that k < n 1_2e . This assumption will be justified later on, 
and then it will be shown how we can still obtain a bound for the redundancy over \I> (A&) for 
larger values of k. Let r be a vector of grid components, such that the first k — 1 components 
6i, i = 1, . . . , k — 1, of G fi must satisfy B\ G r. Let be the 6th point in r, and define it as 



n 1 c n x 



Then, for the 6th point in r, 



b= Vn- Vn £ , (32) 
and also, the spacing A (r^) between points tj, and Tb-i satisfies 



A(r b ) =n-n_ 1 = = — — > L _ £ , (33) 



where the last inequality is obtained because > n^^ £ \ From (33), we see that for large b and 
Tb the spacing between grid points is the same spacing used to obtain the well known bounds for 
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compression of i.i.d. fixed size alphabet sources. However, for small probability parameters, we 
obtain a denser grid. Figure 1 demonstrates this non-uniform grid. 



X X 

X X 

X X 

X X 

X X 



X 
X 



~ 2T 5 05 n"°- 5 ' 1 " e ' 



X 
X 



X 
X 



Figure 1: Non-uniform grid for a large alphabet 



Let us first lower bound the number of points in the standard i.i.d. grid. Let = (61,62, ■■■ , #fc-i) 
be a point on the grid CI. Let bi be the index of B\ in r, i.e., B\ = r^. Then, from (31)-(32), 



fc-i 



fc-i 



fc-i 



(34) 



1=1 



Hence, there is a one-to-one mapping between a grid point and the index vector b = (61, 62, ... , bk-i) 
of positive integers. Since the components of 6 are probabilities, we must have 

fc-i 



5><i. 



i=i 



From (34) and (35), it follows that if 



fe-i 



(35) 



(36) 



i=l 



9 must be a valid grid point. Hence, the total number of grid points is the number of nonnegative 
integer components vectors b satisfying (36). As shown in the next lemma, this number is lower 
bounded by the volume of a k — 1 dimensional sphere with radius y/n l £ , V^-i (^V™ 1 6 ) ( see 
[2] for this volume), where e 1 > e and e' — e is fixed, divided by 2 fc_1 for obtaining only positive 
components. Note that due to the integer length constraints on the components of b we must use 
the greater e' , and we obtain a lower bound (i.e., we consider the volume of a smaller sphere in 
order not to include integer vectors that are not in the sphere). 
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Lemma 5.1 For the standard i.i.d. case with k < n 1 2e , the number of grid points satisfying (35) 
is lower bounded by 

v k -i (W _£ ') i f ^-y^r 1 ^ ; k odd, 

lv± l.l.d. - 2 k - 1 2 k - 1 | [(fc-2)/2]!-7r(* : - 2 )/ 2 -2 fc - 1 -n( 1 - E/ )( fc - 1 )/ 2 , V ' 

\ (APT)! '■> k even - 

The proof of Lemma 5.1 is in Appendix A. Taking the logarithm of the bound in (37), and 
approximating factorials by Stirling's approximation 




we obtain 

lo g M i.i.d. > ^log^ + ^logy-^lo g A;-0(l). (39) 

Now, let us consider only a portion \& (ft) of the grid ft for the grid of distinguishable patterns. 
The grid (ft) includes all points £ ft for which tp (6) = 6, i.e., only the permutation of any 
point 0' £ ft for which the components are in non-decreasing order is included in ^ (ft). Note that 
this condition applies to the k dimensions of 6 including the additional kth parameter . (For this 
matter, if 6k does not take a point from t, the nearest neighboring points from r will be considered 
as its grid point value.) The transformation from the space A^ to the space Vl/ (A&), that contains 
all the points in ^ (ft), is shown in Figure 2 for k = 2 and k = 3. In the second case, only a 
projection of two components on a two dimensional space is shown. 




Original . 
Space k 



Figure 2: Transformation from i.i.d. space A^ to pattern space (Afc) for k = 2 and k = 3 

In order to lower bound the size M of the grid (ft), we need to take out from ft any point 
9 £ ft that is a non-identity permutation of ip (6). For each point in \& (ft), there are at most k\ 
such permutations (although there may be less). Therefore, we can lower bound the logarithm of 



20 



M, using Stirling's approximation and (39), by 

logM > logM iid -log (A;!) 

k — 1 , n l ~ £ ' k — 1 , 7re 3 . , . 

" g H^ + ^~ g ~ - 2l °Sk-0(l). (40) 

From (40), we note that there exists a constant c such that if k > cn^ 6 '^ 3 the bound above 
becomes negative. The reason is that we eliminated many points from the grid more than once. 
For example, the grid point 6 = [n, ti, ■ ■ ■ , ti] only appears once in the grid ft but was reduced by 
a factor of k\ times to obtain the bound in (40). This problem is negligible for small fc's, because 
such grid points make a negligible fraction of ft. However, for large fc's, almost all or all (for very 
large k's) grid points contain many components that are identical. 

To achieve a more useful bound on the logarithm of the number of grid points for large alphabets, 
we can find the value of k for which the maximal lower bound is obtained from (40). Denote it 
by k m . Then, for k > k m (including k > n), we can fix the first k — k m components of 6 at a 
value of o [l/(n 1+£ (k — k m ))] for all points in G $ (ft), where the ith component, i < k — k m , 
of all 6 G VP (ft) takes the same value, and any of these letters will appear in x n with probability 
going to 0. The other k m components will take the values from a pattern grid for alphabet of size 
k m . Note that now we can justify the assumption that k m < n 1 " 26 , assumed earlier for computing 
the number of grid points. In fact, k m is much smaller as indicated earlier. However, by fixing 
all other components of 6 as described above, the bound for k m applies even for alphabets with 
k > n 1_2£ . If we now show that all points are distinguishable in the grid for k = k m , they will also 
be distinguishable in the grid defined above for larger k. Therefore, the bound for k m will hold for 
every larger k as well. 

The bound in (40) attains a maximum value for k m = (ir/2) 1 ^ 3 ■ n^ £ '^ 3 ~ 1.16n( 1_e ')/ 3 . 
Substituting k m in (40), normalizing by n, and replacing e' by e, we obtain the second region 
of the bound in (30). The first region of the bound is obtained by normalizing the bound in 
(40) by n and substituting e' by e. To conclude the proof of Theorem 1, we only need to prove 
distinguishability in the non-uniform pattern grid. By the weak version of the redundancy-capacity 
theorem, if distinguishability is proved, then the bounds we have obtained lower bound the minimax 
redundancy. 

We will now show that distinguishability in the grid ^ (ft) is a direct result of the distinguisha- 
bility in the grid ft. Let the sequence X n be generated by the point 6 = rp (0) E ^ (ft). Let ^ (X n ) 
be the pattern of X n . Consider the estimator ?/> (0 \ of obtained as a function of ^ (X n ), and let 
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Oq be the nearest point to tp on the grid * (fi). We will show that there is an estimator ip ^Oj 
for which Pg (o n / O^j — > as n — > oo. By definition of ^ ($1), the components 6>j; 1 < i < k, of 
are in non-decreasing order. Let be the ML estimator of 6 from X n , and Oq the closest point in 
Q to that is used to estimate in the standard i.i.d. case. Let -0 be the ordered permutation 
of 6, that can be obtained directly from the pattern (X n ). For every 9i, i = 1, . . . , k, let T\ )i be 
the nearest point in r to 9i that is smaller than or equal to Q{. (Note that for i < k, T5. = 9i, and 
only for 9k it may be smaller than 9k-) Define the event Ai as 



A, 



> ^1, (41) 



i.e., the event in which the ML estimate of component 9i is outside an interval of length A^) 
centered at 0j. (If an error occurs in estimating 9i by 9m, this must be true because A (75J /2 is 
at most half the distance between 0j and its nearest neighbors.) Let event ^ (Aj) be defined as 

*(Ai) : |^(e)-^|>^l, (42) 

where ifti denotes the ith ordered component of the ML estimate of 6, where the components 
are ordered in non-decreasing order. Define event A = \J { Ai as the union of all events Ai and 
event ^ (A) = [J i ^ (Ai) as the union of all events (Ai). The probability that event A occurs 
when X n is generated by will be denoted by Pg (A). In a similar manner, Pg (A)] will denote 
the probability that ^ (A) occurs given X n is generated by 6. By definition of \I/ (A) and (42), 
event (A) implies that the ordered version ( ) of the ML estimator is outside the portion 



in ^ (Aj.) of the box with edges A (r^), for every i, centered at 6. The following lemma, which is 
proved in Appendix B, bounds Pg (A): 

Lemma 5.2 

Pg (A) < 2 ( lo s fe )+( lo g™)- c ™ £/2 _> 0, (43) 

where c is a constant. 

Note that Pg (A) > Pg (d n + o), but we require a bound on the larger probability in order to 
apply it to the pattern space. In the standard i.i.d. case, there is thus a vanishing probability even 
to estimating 6 outside the defined box. The following lemma relates between the probability of 
event ^ (A) and that of event A. 

Lemma 5.3 

Pg [\f (A)] < Pg (A) . (44) 
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Proof: We show that A — > where vl is the complement to A, and event B is 

defined below. Therefore, [* (A) U B] = {[* (A) flB]UB} — > A, and thus also * (A) -> A and 
Pg (^4)} < Pe The proof consists of the following steps: First, let (p C (9\, 2 , . . . , 9k-i,n k ) 
be a subset of with 0& replaced by the nearest smaller grid point. Let <p consist only of distinct 
(unequal) elements. Let the components of <p be ordered in increasing order. We show that A 
implies that (p is also in increasing order for any choice of <p as described above. The latter event 
is denoted by B. Hence, the respective ordered components of tp (<p) will not be permuted from 
those of (p. This means that if ip (<p) / (p (i.e., ip {(p) is a non-identity permutation of Cp and event 
B occurs), A must occur. Then, we show that for equal components of 6, although the components 
of Q may not be ordered, if A is satisfied, then each of the ordered components of (o^J must 
satisfy ^> (Ai). Together with the first step, this means that given A, at least k — 2 components 
of ip (j)^J must satisfy event ^ (Ai). The only remaining components of ip consist of at most 
tp k (j)^J and one more component ip\ (jt^J which takes the value of 9 k if Ok is not the maximal ML 
component of 0. (Otherwise, the proof is complete.) For these two components, we show that 
{[* (Ai) U ^ (A k )] nB}^A, concluding the proof. 

First, assume that A occurs. Then, for alH; 1 < i < k, 

ft-ft|<^ -^<e l -e l <^. (45) 

Let T\). > T\) V Note that by definition of 6 as an ordered vector and of t^. > implies that 
9j > 9i (and also that j > i). (The other direction is true for j < k.) Given A, we thus have, 



§. -§ i= (§. - e 3 ) + [9 j - Si) + fa -§>- + A (r b .) - ^p*2 > 0, 



(46) 



where the first inequality is obtained by applying the left hand side of inequality (45) to the first 
two and the last two terms, respectively, and by applying the left hand side of (33) to the two 
middle terms. The last inequality is from the monotonicity of A (t&) in b. Hence, if > , then 
A implies that we must also have 6j > 9i (and event B must occur). This means that if the ML 
estimates of two letters separated by at least one grid spacing unit are within the boxes defined in 
(45), then these ML estimates are still ordered in the same order as the original letters. Hence, the 
only case where ML estimates of two different letters may not be in the original order of the letters 
is when = for j > i. For j < k, this implies also that 9j = 9i, and thus if t/> i = 9j but 
also (45) holds for 0j, then, 

* (i>) - 0,\ = \i s - *| = \», - «,| < ±M = (47) 
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Therefore, for all i < k, except for at most i = k and one value i = I < k, for which 77, = T&, , if A 



occurs, also ^ (Ai) occurs. This is because except for permutations with 9k, the only permutations 
violating the order of 6 in the resulting 6 can occur between letters with equal probabilities in 6. 



> (48) 



From the last inequality, such permutations still result in occurrence of ^ (Ai). 

The only case in which 9j > 9i does not necessarily imply 9j > 6i is when j = k and Tf )i = t&, . 
Let us now consider this case when 9k is not the maximal component of 0. (If 9k is the maximal 
component of 6, the order of the estimates in 6 is not violated beyond permutations of equal 
components in 0, and we are back in the previous cases, for which the lemma has already been 
proved.) Let §i be the maximum component of 6. Then, i\) k (o) = 9{. Also, there exists for 
which Tb t = n k , such that tf>i (jij = 9k- We show that if either ^ (Ai) or ^ (A^) occur together 
with B, then either Ai or A^ must occur as well. 

First, let ^ (A]A occur, i.e., 

Qi — 6k 

If 9i>e k , 

9i-9i= (Oi - 9 k ) + (9 k - 9i) > (49) 

where the inequality is by definition of this case and by the ordering of 6. The last inequality 
means that Ai occurs. If 9i < 9 k , 

k ~ 0k = (0k ~ 0i) + (0i ~ Ok) > ^y^, (50) 

where the inequality is, again, by definition of the case, and by the assumption that 9i is the 
maximum component of the ML estimate of 6. This inequality implies that A^ occurs. Now, let 
^> (Ai) occur for I defined above. Then, 

\^{e)-0\ = \e k -«\>^ = ^^^, (5i) 

where the equalities are since the occurrence of B implies t& ; = Tb k = ■ If Ok < Oi , 

Ok - h = (9 k - 90 + (e t - 9k) > = (52) 

where the inequality is obtained similarly to the previous cases. The last equality is from the 
occurrence of B. This implies that A k occurs. If 9k > Oi, in a similar manner, 

0i-0i= (§i - 9 k ) + (§ k - 0i) = (Oi - 9 k ) + (9 k ~ Oi) > = (53) 

where the second and last equalities are because of the occurrence of B. This implies Ai occurs, 
and concludes the proof of Lemma 5.3. □ 
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The proof of Lemma 5.3 considered three different cases relating between two components $i 
and 6j; j > i, of 9. Figure 3 shows the projection of these three cases onto a two dimensional 
subspace that contains only components % and j. The dots represent grid points. A rectangular 
box surrounding a dot contains all the ML estimator points that are in event Ai n Aj if Oj) is on 
the dot. The first two cases are in part (a) of the figure, and the last in part (6). In the first case, 
the complete box is contained in (A^). This is the case in which 9j > 9i\j < k. The occurrence 
of A implies event B, which means that the ML estimates in this case will remain in the original 
ordering, i.e., estimating the components of ip (o^J out of \& (X n ) will give the same estimates as 
those obtained by estimating 9 out of X n . Note that P e (Ai) U * {Aj)} < P e {Ai U Aj), where 
a possible decrease is because some un-typical sequences that have ML estimates 9 VP (A&) will 
be projected into the same box around 9 by estimating out of ^ (X n ) and will (insignificantly) 



increase the probability of \I> {Ai) U ^ {Aj) from that of Ai U Aj 





(b) 



Figure 3: Decision regions in a two dimensional projection of the pattern grid ^ (f2) 

In the second and third cases, the box around 6 contains a region that is in A& but outside 
\l/ (Afc), i.e., there exist sequences x n that can be generated by 6 and result in an ML i.i.d. estimator 
9 of 9 that is still within the box defined above, but is not properly ordered, and thus ip (j)^J / 9. 
As shown in the proof of Lemma 5.3, this can only occur when 6>j = 9j, as in the second case in 
Figure 3, or when j = k, and = n k , as shown in the third case of Figure 3. As shown in the 



proof of Lemma 5.3, both cases still result in A — > \I> {A) when estimation is done according to 
^ {x n ). From Figure 3, we see that this is the case, because the re-ordering of 9 to generate ip (jij 
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means projection of components of 6 over the diagonal lines as shown for both cases in the figure. 

To conclude the proof of Theorem 1, we need to consider the estimator n G VP(fi), which 
estimates 6 = if) (0) by the point in * (fi) nearest to ?/> (j)^J . Based on Lemmas 5.2 and 5.3, we 
show that the error probability for this estimator, which is solely based on the pattern {X n ) of 
X n , vanishes with n. An error occurs if ^ 6. If this happens, event ^ (A) must happen, because 
the distance between two adjacent grid points is not smaller than 2A (r&J /2 = A (0j). (Note that 
now we only need to estimate the first k — 1 components of 6, since the last component 6k is then 
determined by the others). Also, for the second region of the bound, no error is possible in the first 
k — k m small parameters because they need not be estimated since they are equal for all points on 
the grid * (fi), and the probability that any of these letters occurs vanishes. Hence, 

Pe fit ^e)<P e [tf (A)] < P (A) - 0. (54) 

This concludes the proof of Theorem 1. □ 



6 A Lower Bound for Most Sources 

The analysis in Section 5 cannot be used to lower bound the average pattern redundancy for most 
sources. This is because of the non-uniform grid. The strong version of the redundancy-capacity 
theorem requires the sources in each set of M sources to be uniformly distributed for the result in 
(25) to hold. However, randomly choosing a non-uniform grid, generating a uniform distribution 
of the sources in the grid, results in an overall non-uniform distribution of the sources in ^ (A&), 
because sources in the dense areas are more likely to be chosen. The redundancy-capacity theorem 
can still be used, but the bound that is obtained will be a bound on the class, assuming the sources 
are distributed with a non- uniform prior in the class ^ (A&). Such a bound is not a bound for most 
sources in the class in Rissanen's sense. 

To derive a lower bound on the redundancy for most sources in the class * (A&), a different 
approach from that in Section 5 must, therefore, be used. Instead of a non-uniform grid, we show 
that sources in the centers of disjoint spheres with radius r = n -0-5(i-e) j n ^ _ -y di meriS ional 
pattern space are distinguishable, and count the number of spheres that can be packed in the space 
^ (Afc) (see [2] for information about the sphere packing problem). This sphere lattice can be shifted 
to cover the whole class for different choices of M points. Hence, the conditions of the strong version 
of the redundancy-capacity theorem are then satisfied, and the normalized logarithm of the bound 
on the number of spheres becomes the lower bound on the redundancy for most sources. (This 
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approach resembles Rissanen's pioneering work [24] for sources with a finite number of parameters. 
However, here the asymptotics change due to the consideration of patterns and large alphabets.) 

Since we no longer take advantage of the fact that sources that vary only in small parameters 
are still distinguishable, the size of the grid that is constructed reduces w.r.t. that of the minimax 
bound. This leads to a smaller lower bound on the redundancy for most sources, hinting that 
it may be possible to compress most sources in the class better than the worst sources. This 
is reasonable because many sources, with large k in particular, may generate very compressible 
pattern sequences, that may decrease the overall average redundancy. On the other hand, however, 
this redundancy reduction may also be due to looseness in the bounding techniques. The orders of 
the bounds obtained remain the same as those of the minimax bound, but for large alphabets, the 
coefficients become smaller. For small alphabets, the decrease in the bound is reflected in a smaller 
second order term. We proceed with Theorem 2, that lower bounds the redundancy of patterns 
generated by most sources in the class and conclude this section with its proof. 

Theorem 2 Fix an arbitrarily small e > 0, and let n — > oo. Then, the nth-order average universal 
coding redundancy for coding patterns induced by i. i. d. sources with alphabet size k is lower bounded 
by 

for every code L(-) and almost every i.i.d. source 9 6 A& ; except for a set of sources A £ (n) whose 
volume goes to as n — > oo. 

Theorem 2 shows similar behavior of the redundancy for most sources to that shown by 
Theorem 1 for the minimax redundancy. For small k, each probability parameter, again, costs 
0.51og(n//c 3 ) extra code bits. For large k's (including k > n), we obtain a redundancy bound 
of O (n~ 2 / 3 ) , identical for all large values of k. The lower bound of Theorem 2 naturally is the 
strongest sense bound and applies also to the minimax average and individual redundancies. It is 
therefore smaller than the other two sets of bounds. While the first order term in the first region 
of (55) is equal to that of (30), the second order term here is negative and decreases the redun- 
dancy for most sources linearly with k, whereas the second order term of the first region in (30) is 
positive and increases the minimax redundancy linearly with k. In the second region of the bound 
in (55), the coefficient of the redundancy which approximately equals 0.74 decreases w.r.t. that of 
the minimax redundancy in (30), which approximately equals 2.52. 
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The proof of Theorem 2 lower bounds the volume of the space ^ (A*.), and then uses sphere 
packing density results [2] to lower bound the number of spheres that can be packed in this vol- 
ume. Then, it is shown that sources at centers of disjoint spheres with radius r = n- - 5 ^-^ are 
distinguishable also in the pattern space, i.e., by observing \I> (X n ). There are two methods that 
bound the volume of the space ^ (A&). The first takes the volume of which by condition (35) 
must be l/(k — 1)!, and divides it by k\ to extract all permutations of the same sources, resulting 
in a volume of l/[(k — l)!fc!]. The other method directly computes the volume of ^ (A^) from the 
conditions defining an ordered vector if) (6). Both methods obtain the same bound on the volume 
of ^ (Afc). We will, therefore, demonstrate only the second one. Since the second method is tight, 
it hints to the fact that, unlike the reduction of the grid fl in Section 5 by a factor of kl to form 
the grid * (fl), the reduction of the volume of A^ by a factor of k\ to bound the volume of ^ (A&) 
is tight. This is because of the difference in considering a grid and a continuous space. In the 
continuous space A&, sources with several exactly identical components make a negligible portion 
of the space (as the probability of any single point is zero), whereas such sources are not negligible 
when we construct a grid as in Section 5. 

Although the bounding of the volume of ^ (Afc) is tight, we still encounter a similar phenomenon 
to that in Section 5, where there exists a constant c, such that for every k > cn^ £ ^ 3 , the bound 
becomes negative. This is due to another step in the bounding. In this analysis, we bound the 
number of spheres packed in ^ (A^) dividing the volume of \E' (A^) by a volume of a single sphere and 
factoring a packing density factor. However, as k increases, most spheres contained in \E f (Aj.) have 
only portions in the space, whereas big portions of those spheres are outside the space. Therefore, 
division by the complete volume of a sphere results in loose bounding of the number of sources 
that are still distinguishable in the space. We solve this problem in a manner that resembles 
the solution in Section 5. Let k, m be the value of k for which the bound is maximal. Then, for 
k > k m , instead of considering the whole space ^ (A&) and bounding the number of spheres in it, 
we bound the number of spheres in a slice of this space, in which there are only k m sufficiently large 
probability parameters, and all the other k — k m probability parameters sum to an insignificantly 
small total probability. This idea is best pictured if one considers packing spheres in a triangular 
based pyramid. The number of circles that can be packed on its basis is larger than the number 
of circles that can be packed in any horizontal two dimension cut above the basis. If the spheres 
are very large, we may not be able to pack any complete two dimensional cuts of these spheres 
above the basis. Since we are not interested in complete spheres in all dimensions, it is sufficient 
to consider the number of dimensions that will give the maximum number of sphere portions that 
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are packed in the space. This number is a lower bound on the total number of sphere portions that 
can be packed in the space. Using only k m dimensions in the sphere packing analysis, we obtain 
the second region of the bound. Note that when we shift the sphere lattice to obtain a covering 
of the whole space, some center points that represent sources in the set will no longer be in the 
space, reducing M. However, the lower bound on M obtained from the k m dimensional cut will not 
be affected, when at the same time the shifting allows the space covering condition of the strong 
version of the redundancy-capacity theorem to be satisfied. 

As in Section 5, we also need to show that distinguishability in the i.i.d. space carries over to 
the pattern space. This is, in fact, easier than in the minimax case. All we need to show is that 
a point 6 in A& outside ^ (A&) but still in a sphere that is centered inside ^ (A&) projects onto 
a point ift (^j that is still in the same sphere. The point ip (^j is the one that will be obtained 
directly from (X n ). Therefore, if the ML i.i.d. estimator of based on X n is outside (Afc) 
but still distinguishable in the i.i.d. space, its projection ip (^j into ^ (A&), obtained from \£ (X n ), 
is still in the same sphere. This is shown by geometric considerations demonstrated as a series of 
exchanges that rearrange the components of 6 into ip (jij by exchanging a pair in each step. We 
conclude this section with the proof of Theorem 2. 

Proof of Theorem 2: We begin with bounding the volume of the k — 1 dimensional space ^ (A^). 
Only ordered vectors 6 for which 0\ < Oi < • • • < Qk-\ are contained in ^ (A^). This can be used 
to set constraints on a k — 1 dimensional integral that bounds the volume of ^ (A^). By condition 

(35), 

k-l 

i>5>> (&-i)0i 01 -k=T- (56) 
i=i 

Similarly (and more generally), 

i—i k—i i _ sr^i— i 

(k-i) 



l-E^E^^-Oft => 0i< 1 (57) 



3=1 l=i 

Now, (57) gives upper limits on every component of 0. The ordering condition of 6 that is necessary 
for 6 to be in ^ (A/%) gives lower limits on each component of 6. Ordering is maintained by the 
above conditions except for the kth component Ok- Therefore, the volume obtained by a k — 1 
dimensional integral over 1 within all these limits needs to be reduced by a factor of k to only 
take the k dimensional permutations for which Ok is not smaller than all other components of 0. 
Including all the constraints, V [$> (A&)] is computed in the following equations: 

V [* (A fc )] = - ■ / dB x d0 2 d0 3 --- dO k -i 

K Jo J8i Je 2 J0k-2 
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I . f" 1 ^ ( — - — 7 [i - (A; - 1) e^- 2 ] 

k J 1 \P-2)!1 2L J 



* ./o IP- 2) 

fe 1 
J Q = Jkf = (k-l)\-k\ 



l 

[l-(k- l)^]*" 1 



2 " , t „ (58) 



* L P-l)!] 2 

Now, consider packing of k — 1 dimensional spheres with radius r = n- - 5 *! 1 -^) i n \fr (A fc ) so that no 
spheres share the same point in the space. The ratio between the volume V [\£ (Afc)] of ^ (Afc) and 
the volume of one sphere Vfc_i(r) is 



a vq* (A fc )] 
p - 



V fc _i(r) (fe- 1)!- fc!- V3b_i(r) 



Jr (fc-l)/2.(fc_ 1 ) ! . jfc! i « O qu . . 

n i(l- S )(fc-D . 

■ fc even, 



[(fe-2)/2]!-2 fc - 1 -7r( fe - 2 )/ 2 -fe!' 

where we substituted the volume of ^ (Afc) from (58). However, the number of spheres that can be 
packed in \I> (Afc) is bounded by 

M a A ' a (FTT)T¥^w^t - < 60 > 

where the factor A = 2~( fe-1 ) is a lower bound on the sphere packing density, i.e., the fraction of 
the space that is actually occupied by spheres (see [2]). Now, let us choose a grid that contains the 
sources at the centers of all the M spheres packed in ^ (Afc). We can lower bound the number of 
sources in one such grid by using (59)-(60). Taking the logarithm of the bound in (60) and using 
Stirling's formula to bound factorials, we obtain the bound 

k — 1 k — 1 k — 1 8ir 3 1 e 3 / 1 \ 

logM> (l- e )__logn-^-log/c 3 -^-log^3 --logfe + -log— - O f -J . (61) 

As long as the lower bound on M is large, we can (cyclicly) shift the whole grid to allow different 
choices of grids in ^ (Afc) to cover the whole space, and satisfy the conditions of the strong version 
of the redundancy-capacity theorem. All random shifts of the original grid will form a covering 
of \I/(Afc), and can be designed so that uniform distribution is preserved for choosing a point 
6 £ f (Afc) over the whole class and also within every set of M points that is chosen. Hence, in 
this case we can use the normalized logarithm of the number M of points on this random grid as 
a lower bound on the redundancy for most sources if all sources within any shift of the grid are 
distinguishable by the observed random sequence. This yields the first region of the bound in (55). 
However, observing (61), as in the minimax case, the bound becomes negative and useless for large 
k's. As in Section 5, we solve this problem by fixing the bound at its maximum value as a function 
of k. Assume this value is attained at k = k rn . Then, for every k > k m , we will obtain the same 
bound, resulting in the second region in (55). By straightforward differentiation it can be shown 
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that the bound in (61) attains its maximum value for k m = 0.5 (n 1 6 /it 



) . Substituting this value 



of k m in (61), normalizing by n, we obtain the bound of the second region of (55). 

When k m is used to obtain the bound for a larger k, we still shift the complete grid to create 
a covering of the space (A^) in which each source is contained in one grid. Unlike the minimax 
case, here we cannot simply discard points in the grid with k > k m nonzero parameters. These 
must be included in the grid, and distinguishability between them and other points must be proven. 
However, we can lower bound the number of sources in the grid by the number of spheres in k m 
dimensional cut of ^ (A&) for which all the other (first) k — k m parameters are very small, and 
insignificant. This analysis is valid also if k > n, and thus the bound in the second region is 
general, and applies also to such large alphabets. 

Finally, to satisfy the covering of the whole space, we need to show that every source in is 
included in a grid. Demonstrating that only for the ordered permutation is not sufficient. This 
can be done by taking different grids for each permutation vector, i.e., each ordered source tp (0) 
will appear in kl different grids through its permutations. (Since the probability of a single point 
is zero in a continuous space, sources for which identical components exist do not pose a problem.) 

To conclude the proof of Theorem 2, we need to show distinguishability of the grids defined 
above in the pattern space. We show that this is a direct result of distinguishability of the respective 
grids in the i.i.d. space. First, we state a lemma showing distinguishability in the i.i.d. space, i.e., 
by observing X n , and then we prove another lemma that implies that distinguishability in the 
i.i.d. space causes distinguishability in the pattern space on the reduced pattern grid, obtained by 
observing only \I> (X n ). 

Lemma 6.1 Consider one choice of a random grid in the i.i.d. space A& as defined above. Let 
G Afc be a point on this grid, and let the random sequence X n be generated by the conditional 
probability Pg(X n ) (given 6). Then, the probability that the ML estimator of 6 from the observed 
X n is outside the sphere of radius l/v^ 1 £ centered in 6 vanishes with n, 



for every alphabet size k. 

The proof of Lemma 6.1 is presented in Appendix C. The next lemma shows that the distance 
between two points, one in ^ (A&) and the other in A&, can only decrease if the latter is projected 
into (Afc). This lemma is necessary, because the ordered ML estimator if) (O) obtained directly 




(62) 
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from ^ (X n ) simply performs this projection over the i.i.d. ML estimator 0. Hence, this lemma 
implies that the ordered estimator must be closer to the point estimated, which is in the pattern 
space. 



Lemma 6.2 Let and 0' be two points in Ak, such that £ $ (A&). Then, 



\0-if>(0')\\ < ||0-0'||. 



(63) 



Proof: Vector xj> (#') , which is ordered in non-decreasing order, can be obtained from 0' by a series 
of exchanges between two components i\ and jf, i\ < ji, where each exchange must decrease the 
(index) distances of both components from their location in if) (#') . Namely, let 0'® denote the 
vector obtained after the l-th exchange. Then, 6*^ ^ > and also i\ > i and ji < p, where 

ip L (#') = O'j^ 1 ^ and ip p (#') = 0^ 1 \ i.e., the final destination of each of the components in the 
ordered vector is in the same direction as the exchange. For simplicity, we omit the index I from 
ii and ji when it can be inferred from the context. We show that each exchange can only decrease 
the Euclidean distance to 0. For notation simplicity, let (pi = 6'^ = 9'j l ^ and ipj = O 1 ^ = O'^ 1 1 \ 
Thus ifj > <fi. The difference between the square of the Euclidean distance from before and after 
the exchange satisfies 



o - e'V-v 



o - 0'W 



+ 



o'(i-l) 
3 



q'(0 



= {&i - <Pj) 2 + {®j - Vif - {0i - Vif - {Qj - Vj) 2 
= 2(<p j -<p i )(e j -o i )>o, 



(64) 



where the last inequality is obtained since ipj > ipi and 9j > 0{ since fl 6 f (A&). Figure 4 shows 
a two dimensional projection of components i and j of all vectors for one exchange as described 
above. It demonstrates the decrease in distance to resulting from the exchange. 

Now, using (64), 



\e-e'\\ 2 -\\e-xij{e')\\ 2 = {\\ e ~ o'^f ~ \\ e ~ e>{l) \\ 2 } 

= 2£(C-<°)(^-^)>0. 



(65) 



Since all components of the sum are non-negative, the sum is also non-negative. This concludes 
the proof of Lemma 6.2. □ 



From Lemma 6.2, if 



0-0 

\4> 



< l/v 7 ™ 1 6 , then also 



- V (0) < l/v 7 " 1 £ - Similarly to the 



proof of Theorem 1, now let n be the point in the random pattern grid, denoted by \£ (fi), nearest 
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Boundary Line 




i. i>i 



Figure 4: One exchange step in projection of a source 0' £ onto the pattern space ^ (A&) 



to -0 ydj. Then, using Lemmas 6.1 and 6.2, the probability that a sequence generated by will 
appear by (X n ) to have been generated by another source in the same grid is upper bounded, as 
n — > oo, by 



P, 



il> [e ) -e 



> 



i 



l-E 



0-0 



> 



1 



l-e 



0. 



(66) 



The first bound is since not all points in ^ (fi) are contained in spheres. Hence, distinguishability 
is attained. This concludes the proof of Theorem 2. □ 



7 Upper Bounds 

We now show how to design codes that attain low redundancy for coding patterns induced by 
i.i.d. sequences. We propose a code with good performance for smaller alphabets sizes, namely, 
k < ^Jn 1 6 , for an arbitrarily small e > 0, and combine it with the method in [20] to asymptotically 
achieve the better compression of the two for a specific pattern. The new code uses Rissanen's [24] 
two-part grid based coding approach combined with a non-uniform grid that resembles that in 
Section 5. For a given sequence with k distinct symbols, we find the best fc-dimensional pattern 
probability vector rj) (0), which is the vector that gives the /cth-order ML probability for the pattern 
of the sequence. Note that t/> (0) may be different from and if) (j)^J . (Furthermore, the actual 
ML estimate of a pattern may contain more letters than those actually observed. However, in 
analyzing this code, we constrain the analysis to the average case, in which our reference is the 
^-dimensional pattern probability, and to the class in which it is unlikely that k < k.) Then, 
if) (0) is quantized to a grid. The quantized components are first coded, and then, the sequence is 
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assigned a probability according to these quantized probability parameters. In [20], the number of 
all different types of patterns of length n is shown to equal the number of unordered partitioning 
of the integer n. Given the type, the pattern ML probability vector if> u (9) can be computed, as 
well as its ML probability, which is used to then encode the sequence using a number of bits that 
equals its negative logarithm. Hence, the redundancy is the logarithm of the number of types, as 
shown in the upper bound of (23). The combined code can compute both description lengths, and 
then choose between them, and use the one that requires fewer bits. One bit is needed to relay to 
the decoder which of the codes is used. We summarize the performance of the code combined of 
both codes in the next theorem. 

Theorem 3 Fix an arbitrarily small e > 0, and let n — > oo. Then, there exist codes with length 
function L* (■) that achieve redundancy 



less than k letters will be observed in X n is o{k/n). If the probabilities of all letters are greater 
than l/n 1_e , this condition is satisfied. Note that the proposed code should also achieve good 
performance even if less than k letters are likely to be observed in X n . However, further research 
still needs to guarantee that the penalty does not increase in this case, and is still bounded as in 
the first region of (67). The bound of the first region of (67) also applies to the individual pattern 
redundancy under the assumption that the underlying alphabet contains no symbols other than 
those observed. A weaker upper bound, which is to first order twice the bound of the first region of 
(67) was subsequently derived in [21] for coding individual patterns with k occurring indices as long 
as k = o (n 1 / 3 ). While the bound in [21] is larger (thus weaker) and applies only to smaller /c's, it 
is stronger in the sense that it applies to a wider class containing all sequences in which k symbols 
occur, without restricting the pattern generating alphabet to contain only symbols observed in X n . 
The bound in the second region of Theorem 3 applies to the class 

The upper bounds in (67) show that we can design universal codes for patterns that require 
at most 0.51og(n//c 2 ) bits for each unknown probability parameter, as long as k is small enough, 
essentially of O (y/n) or less. If k is larger, we observe a similar phenomenon as that of the lower 
bounds, in which we achieve the same redundancy for every large k, which is of O (n -1 / 2 ) bits per 




for k < ^/n 1 £ and 9 £ A& 
for k > y/n l £ or 9 A& 



(67) 



for patterns induced by any i.i.d. source 9 G A&, with alphabet of size k. 



The first region of Theorem 3 applies to the class A&, i.e., it is assumed that the probability that 
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symbol overall. This performance is better than that attainable in standard i.i.d. compression. In 
particular, in the first region we gain 0.5 log k bits for each parameter, and the gain increases with 
k in the second region. In Section 9, we discuss a different method that can be used to bound the 
redundancy in the second region. The ideas considered can be used (as in subsequent work [35]) to 
obtain stronger bounds in this region. 

As indicated earlier, we observe gaps between the upper bounds and the lower bounds considered 
in the previous sections. In the first region, the lower bound is smaller by 0.5 log A; bits for each 
parameter, whereas in the second region (as in the results in [17]-[21]), the lower bound is of 
O (n~ 2 / 3 ) overall instead of O (n -1 / 2 ). Naturally, the second region for the lower bounds starts 
with smaller k. Gaps between the upper and the lower bounds are still an open problem and will 
be discussed in Section 9 in somewhat more detail. This section is concluded with the proof of 
Theorem 3. 

Proof of Theorem 3: To prove Theorem 3, we demonstrate and analyze the code that achieves 
the redundancy bound for the first region of (67). As mentioned earlier, a given pattern is encoded 
by this code as well as the code in [20], and the one with the smaller description length is then 
chosen. One bit is used to convey which code has been used (resulting in the additional 0(l/n) 
term of the second region). The rest of the proof is thus focused on the first region and bounding 
the performance of the new code. The proof for the second region is concluded using [20]. 

Using the code for the first region, we first need (l + e)log/c bits to encode the number of 
occurring letters k with Elias's coding for the integers [8]. Let 6 G and k < y/n l ~ e . Let 
ip (0) = [il>i,ip2, ■ ■ ■ , V'fc) be the fc-dimensional probability vector that maximizes the probability of 
(X n ) in (9) for X n . Let be the i.i.d. ML estimator of 6 from X n . Let r = (n, r 2 , . . . , r b , . . . , t b ) 
be a grid of B points whose 6th component is defined in a similar manner to (31), where —e is 
replaced by e, i.e., 

n l+e n l+e V ) 

Thus, there are 

B = v^ 1+£ (69) 

points in r. Let <p = (<pi, <f2, ■ ■ ■ , </?fc-i, <fik) be a quantized version of ip(0), for which each of 
the first k — 1 components ipi takes one of the two nearest grid points surrounding ipi, i.e., if 
ipi £ [rb, t&+i], (fi equals either Tb or Tb+\. The point that is chosen for tfi between the two grid 
points is the one that minimizes the absolute value of the cumulative difference between the first 
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k — 1 components of xp (0) and those of cp such that the non-decreasing order of the components 
of (p is retained. This ensures that the last largest component p^ of ip is within the defined grid 
spacing around xp^, even if it does not take a value in r. 

The code first codes the first k — 1 components of (p, and then computes P^ (X n )], and uses 
(up to integer length constraints) — logP^ (X n )] bits to code the pattern. The average code 
length for 9 G A*, and k < y/n l ~ s is thus bounded (up to integer length constraints) by 

E e L* [* (X n )] < 1 + (1 + e) log k + E e {l* r [^)] } - £ e {logP v [* (X")]} , (70) 



where L|j 



■0 (0) is the cost of representing the quantized version <p of -0 (0). The first term of 
1 is the cost of one bit distinguishing between the two codes. The second term is a bound on the 
cost of representing k < k. The last term is the cost of coding the pattern using the quantized ML 
estimates in <p. The inequality is also since some patterns may be represented shorter by the code 
from [20]. Denoting an upper bound on the representation cost of an up to /c-dimensional vector 
(p by L* Rk , the average redundancy for 6 G and k < v / n 1_e is, therefore, upper bounded by 



nR n [L*,xp(0)} < 1 + (! + £) log k + E e {L R [xJ(9)}}+Ee! [ log 



Pe [g (*")] 



< I + {I + e)\ogk + L\ k + P e (k<k)n\ogk + E e l log 



k = k 



= L\ k + E e \\og 



P V [*(X")] 

k=A + (k\og^y (7i) 



P V [*(X")] 

The second inequality is since at most log k bits are required to code every index, and also because 
the pattern probability w.r.t. the /c-dimensional ML estimate is not smaller than the probability 
w.r.t. the actual parameter 6. The next equality is because of the assumption that P e (k<k} = 
o(k/n), and since o (log/c) = o (log(n/k 2 )) by definition of the region. 

To complete the bound in the first region, we now need to bound the remaining first two terms 
of (71). These two costs are the cost of coding (p, and the cost of using the quantized version <p of 
the /c-dimensional pattern ML probability estimator tp (6) instead of using the actual /c-dimensional 
pattern ML probability estimator. For the remainder of the proof, we can now assume that k = k 
because for the first term, we will obtain a bound that increases with /c, and for the second term, 
we compute the expectation conditioned on this event. We next bound the two costs and show 
that the second is negligible w.r.t. the first in the first region of the bound. This together with (71) 
results in the upper bound for this region. 
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Instead of coding the first k — 1 components ifi of (p, we can code their indices in r. Let b (ifi) 
be the index in r of the grid point that equals ipi. Since the vector <p is ordered, we can use a 
differential code which uses 

(1 + e') log [6(^)-6(^_i)+c] 

bits, where c is a constant, to represent the integer displacement to the index of ipi from that of 
(pi-i with Elias's code, where e' > is arbitrarily small, b(<po) = 0, and extra c bits are added to 
apply for zero or small displacements. Hence, altogether, we will need 

fc-i 



T* 



${9)\ = 2(l + e')log[6( ¥ » i )-&(¥'<-i) + c] 

i=l 

B + ck 



< (1 + e') (k - 1) log 



k-l 

< (l + £l )— log^^Zj^ (72) 

bits to represent ip, where the first inequality is obtained by Jensen's inequality, and the second 
follows directly from (69) and the assumption that k = o (\/n) by absorbing low-order terms in e\. 
Note that the last inequality in (72) holds only for k = o (y/n). The bound of (72) is used to bound 
the cost of representing <p in (71). (We note that in Section 9, we will demonstrate a method that 
yields representation cost for <p which is fixed at O (n ( - 1+e - > / 3 ) bits. For k > n 1 / 3 , this cost is better 
than that in (72). However, the cost of quantizing the pattern ML estimator, which is shown next, 
will overwhelm this cost for large k.) 

We now bound the second term of (71). The probability of ^ (x n ) can be expressed as in (9) 
by summing over all sequences that have the same pattern with a fixed parameter vector. On the 
other hand, we can express the same probability by fixing the actual sequence and summing over 
all permutations of the parameter vector 

Pel*(x n )] = ^Pe { «)(x n )- (73) 
a 

Now, to bound the cost of quantizing the pattern ML estimator reflected in the second term of 
(71), we consider the logarithm of the ratio between -P^y I® (X n )] and [$> (X n )]. We can express 
each of the two probabilities using (73). Then, we discard permutations of tp (9) that give negligible 
probability for X n (and their respective quantized versions) from each of the sums in the ratio. 
Next, we bound the ratio between the probability of X n given a non-negligible permutation of ip (6) 
and that obtained by the quantized version of this permutation. We obtain the same bound for all 
these permutations. This bound can, in turn, be used to bound the ratio between -P^y (X n )] and 

37 



P v (X n )]. To obtain the bound for all permutations, we need to bound the absolute differences 
between fa and ipi, and between 6i and </?(o"j), which is the <7jth component of the permutation 
of (p according to permutation vector <x. The first difference is a direct result of the definition 
of the components of r in (68). The second difference is the reason we need to omit negligible 
permutations of ip (6) from the analysis. If we do not omit such permutations, we will be unable 
to bound this difference. Lemma 7.1, which is presented next, demonstrates that if the distance 
of components of a permutation of ip (6) from the respective non-permuted components of 6 is too 
large, then the contribution of the conditional probability of this permutation to the probability of 
the pattern of X n in (73) will be negligible. A corollary to the lemma (which is shown in Appendix 
E as part of the proof of Lemma 7.2) will give us a bound on the absolute difference between 
components of a non-negligible permuted version of tp (9) and those of 0, which, in turn, will lead 
to a bound on the desired difference between 8i and </?(<7j). 

We begin by showing that there are permutations of the pattern ML estimator that contribute 
negligibly to the pattern probability. The following lemma can be used to demonstrate that. The 
lemma is stated more generally. 



Lemma 7.1 Let n — > oo. Let be the standard i.i.d. ML estimator with k non-zero components of 
the probability \ 
vector. Define 



the probability vector that governs X n . Let d> = (fa, fa, . . . , fa) be another k-dimensional probability 



Si = i = l,2,...,k. (74) 

Assume that there exists a set J of at least j indices i G J, 1 < i < k, such that 

\k\ > < 



' if</H>26i, 



r^745 iffa<29i. 



(75) 



Then, as n — > oo, 

k\P^ (X 1 



0. (76) 



P § (X») 

Lemma 7.1 shows that if there are too many components of a vector (f> that are far from those of 
6, then even if we multiply the probability of X n given by k\ it still remains negligible w.r.t. the 
ML probability of X n . The lemma shows that this is true for large distance with few components, 
as well as smaller distance with more components. Lemma 7.1 is proved in Appendix D. 

For the sake of simple notation, let -0 = if) (6) denote the pattern ML probability parameter 
vector from this point on to the end of the proof of the redundancy of the code for the first region. 
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(This is a slight abuse of notation, but is much less tedious.) Now, if} (er) and cp (a) are the 
permutations of t/> and ip, respectively, obtained by permutation vector a. Define the set A as the 
set of all permutation vectors a, for which <fi = ip(cr) satisfies the condition in Lemma 7.1 w.r.t. 6. 
Note that Lemma 7.1 also implies that if> cannot satisfy its conditions w.r.t. 6. Then, given that 
for every if>(cr) -A, we obtain Pm^) (X n ) jP^{c) (X n ) < a, for some expression a, the normalized 
contribution of the quantization of xj) to the redundancy for every x n with k = k < ^/n 1 £ observed 
symbols can be upper bounded by 

1, P m g (*")] 1, ZaP^)(* n ) 

■ lo § o r,T,/ „vi = - lo S 



n 



< 1 g + £ 2) Ecr : ^ (< T)^ gK^) 

. 1. ( 1 + £2 ) J2o-:if>((T)£A aP f{<y) i xn ) e 2 loge log a 

< - lo § ^ 5 r~^\ ^ + • ( 77 ) 

The first inequality is obtained from Lemma 7.1, using a fixed arbitrarily small £2 > 0, and also by 
decreasing the denominator. The last inequality is obtained since In (1 + x) < x for every x > — 1. 
To complete the bound, we need to find a. This is done in the following lemma. 

Lemma 7.2 Let if) be the k- dimensional pattern ML estimator obtained from X n for k < ^/n l ~ s , 
let ip be its quantized version, and let a be a permutation vector such that i/)(cr) A. Then, 

, p ip{a) ( x ' n ) . cklnk 

where c is a constant. 

The proof of Lemma 7.2 is in Appendix E. We can plug (78) in (77) for a particular x n to show 
that 

1 , P ijA8) ^ ( X ^ ^ £2 log e cklnk ( k 



n bg P„[tf(s»)] " n + ni+^/ 4 °W' ^ 
and hence the quantization cost is negligible w.r.t. the cost of representing <p in (72). Plugging the 
bounds of (72) and (79) in (71), absorbing all low order terms in the leading e, normalizing by n, 
we obtain the upper bound of the first region of (67), thus concluding the proof of Theorem 3. □ 
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8 Low Complexity Sequential Schemes and Pattern Entropy 



We now present two sub-optimal low-complexity sequential algorithms for compressing patterns. 
We are interested in analyzing the performance for various alphabet sizes, and in the total descrip- 
tion length of a pattern, which can be obtained by adding the modified redundancy we obtain here 
to the i.i.d. entropy as in (15). The results in this section provide bounds on the universal descrip- 
tion length for coding patterns. A very interesting corollary is that for sufficiently large alphabets, 
the universal description length of patterns is smaller than the i.i.d. entropy. This points out to an 
interesting phenomenon where the pattern entropy must decrease from the i.i.d. one for sufficiently 
large alphabets. Subsequently to the work reported in this paper, pattern entropy and entropy rate 
have been extensively studied, first in [34], and later in [11]-[12], [22]-[23], [31], [38]- [39]. 

8.1 Known Alphabet Size and A Mixture Code 

Let us first assume that although the alphabet S itself is unknown, its size k is known. For 
coding i.i.d. sequences, Krichevsky and Trofimov [15] demonstrated that the minimum description 
length (MDL) for i.i.d. sequences [24], [30] can be sequentially achieved using sequential probability 
assignment, which when combined with arithmetic coding [25] results in an optimal sequential code. 
In particular, they defined the probability Qkt (x n ) which is sequentially assigned to the sequence 
x n as 

n 

Qkt (x 11 ) = Y\ Qkt {%i I x*" 1 ) , (80) 
i=i 

where Qkt (xi \ x l ~ l ) is a conditional probability assigned to the ith symbol Xi, given the subse- 
quence of all the preceding symbols x 1 ^ 1 . It is defined as 

Qkt( Xi \x )= i _ l + k/2 , (81) 

where n l ~ l (xi) is the number of occurrences of the symbol Xi in the subsequence x l ~ x . 

We can adopt this approach for coding patterns if we know that k symbols occur in the sequence. 
If a letter (or index) has already occurred, we can still update the probability as in (81). However, 
once a new symbol occurs, i.e., Xi is not contained in the subsequence x i-1 , \£ (x,) will be determined 
as the next available index, regardless of the actual value of X{. This means that the event that 
* [xi) will take a new value not in ^ (x t_1 ) should be assigned the sum of the probabilities of 
all letters u € X that have not yet occurred. Hence, similarly to (80), * (x n ) will be assigned 
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probability 



where 



Qk[*(x n )]=f[Qk[*(x i ) Ix 1 " 1 ], (82) 



i=i 



Q k [* (xi) | x^" 1 ] = < 



(83) 



, (fc i-?+fc/2 /2 ' otherwise, 
where Cj_i is the number of distinct letters that occurred in the subsequence x l ~ l . Theorem 4 
summarizes the performance of the probability assignment in (82)-(83). 

Theorem 4 Let n — ► oo. Then, the individual modified redundancy of the probability assignment 
in (82)- (83) is upper bounded by 

ft [ft, * (*")] < I log £ + (| log .) 5 - i- ,„ g „ + o (£) . (84) 

/or every pattern ^ (x n ) of a sequence x n with k distinct indices and for every k < n. 

The proof of Theorem 4 is purely technical relying on Stirling's approximation and is presented 
in Appendix F. From the proof, we can see that the last term is fc 2 (loge)/ (4n 2 ), which is always 
negligible w.r.t. the sum of all other terms. We can also notice from the proof that if almost all 
letters in x n (except o(k)) occur more than a fixed number of occurrences, (84) reduces to 

Rn [Q k , * (*")] < A log IL + (1.5 log e) \ - ± log n + O (^) , (85) 

for every k (i.e., the second term is slightly smaller). However, if there are 0(k) letters that occur 
only one time, we must include the term of at most fc(loge)/12, obtained from the upper bound 
of Stirling's approximation. Worst sequence case bounds on the individual true redundancy can 
be easily obtained from Theorem 4. If the alphabet size is limited to k, the pattern probability of 
the worst sequence will be at most k\ times its i.i.d. ML probability. Hence, the redundancy will 
increase by log(fc!), yielding the same redundancy as that of the i.i.d. case of 0.5(/c — l)log(n/fe) 
bits, which diminishes for k = o{n). 

The expression in (85) attains a maximum for k = n 1//3 (neglecting the last two terms). The 
maximum of (85) meets the performance of the minimax code in (22). In fact, a minimax code 
that does not distinguish between different fc's adopts the worst case performance of k = n 1 / 3 for 
every value of k. For k > e ■ n 1 / 3 , R n [Qk, ^ (x n )] in (85) becomes negative. (This is also true for 
(84) for k > e 19 / 18 • n 1 / 3 .) This means that the number of bits required to code the pattern is 
smaller than the negative logarithm of the ML i.i.d. probability of x n , and that the pattern entropy 
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is much smaller than that of i.i.d. sequences for large fc's. This result cannot be observed from the 
lower and upper bounds of the previous sections because they refer to the true redundancy w.r.t. 
the pattern entropy. Further study of pattern entropy [31] extensively characterized the behavior 
of the pattern entropy for different alphabet sizes and arrangements of the letter probabilities in 6. 

The main drawback of the code above is that it requires knowledge of k. A "semi-sequential" two 
pass code that identifies k during the first pass can be used to achieve almost similar performance 
with additional (1 + e) log k bits to inform the decoder of k. Elias's coding of the integers [8] can 
be used to first encode k, and then the scheme of (82)-(83) is used to code the pattern. To avoid 
the use of a two pass code, one can perform a mixture over all possible values j of k. This can be 
done by assigning at every i; 1 < i < n, 



Q[*(^]=^£^[*(*% (86) 

where Qj (x n )] is defined by 



J=2 



Qj [* (xi) | x 1 - 1 ] 



A 



if j > Cj_i and n % 1 (xj) > 0, 



n i - 1 (x i )+l/2 
i-l+j/2 ' 

{J ~. C ^})1 2 , if j > Ci_i and n*" 1 (a*) = 0, (87) 



i.e., as long as the number of distinct occurring letters does not exceed j — Qj [\P (x 1 )] is equal to 
Qj (x 1 )] . Otherwise, Qj (x*)] assigns equal probability to all existing indices and also to the 
innovation index. Then, Q [\P (x 1 )] is averaged over Qj [vP (x 1 )]. The assigned probability satisfies 
for the actual k, 

Q^(x n )}>lQk + i[^(x n )}, (88) 

where we must consider index k + 1 in case all k symbols occur first earlier than at time n. This 
leads to modified redundancy of 

4. IQ, * Ml < I H ^ + (| log.) k - + I ^ + O (g) , (89) 

where the third term diverges from (84) because of the mixing and the use of k + 1 instead of fc, 
for this linear per-symbol complexity scheme. 



42 



8.2 Unknown Alphabet Size 



The assignment described requires extra manipulations or complexity for an unknown k. In [41], a 
more generalized form of (81) was presented, in which 

* f n '" 1 (^)+' y ;f ( T .\ > n 

Qgkt (a* I x*" 1 ) ^ '-^^ ■ ^ > °» j (90) 

( (M-fi-OCi-i+Q-^+^-i) ' oth ™ se 
where ^ > is some constant, is some function of the subsequence x i_1 , and M is a bound 
on the maximum number of alphabet letters. This extension of (81) allows asymptotically optimal 
performance for coding i.i.d. sequences. This performance depends only on the actual number k of 
alphabet letters that occur, and not on the total alphabet size M. 

It turns out that with correct modification of (90), one can sequentially (with fixed per symbol 
complexity) asymptotically (with k — »■ oo) achieve the same performance of (84)-(85) for patterns. 
Let us consider the code in which 

— — — ..a-c ,„ , ii n (X t > u, 

i-l+Cj_i/2+(Gj_i+l) /2 /q^\ 



Q [\f (xi) | x*- 1 ] = < 

where e > can be chosen arbitrarily small. Theorem 5 summarizes the performance of this code. 



I i-i + c%£lk-/+i)'-</2 > otherwise, 



Theorem 5 Let n — > oo. TTten, i/ie individual modified redundancy of the probability assignment 
in (91) is upper bounded by 

^ r~ T / t.m , n / 19 \ .A; 1 

i^[Q,*(x")] < — log^+ --e (loge)--— logn + 



2n k 3 \12 J v b ; n 2n 

l^ l0S T + ^ + b)' (92 > 

/or every pattern $ (x n ) of a sequence x n with k distinct indices and for every k < n. 

The proof of Theorem 5, again, relies on Stirling's approximation. It is presented in Appendix G. 
The bound in (92) is shown such that the first row contains the terms (up to e) identical to the 
upper bound in (84). The second row contains the additional terms that increase the bound due 
to the reduced complexity. If k — > oo and e is arbitrarily small, the bound in (92) asymptotically 
meets the modified redundancy upper bound of (84), even if k goes to infinity at a slower rate 
than n. However, for smaller fc's, the two terms in the bottom row increase the redundancy, and if 
k > n 1 / 3 , work against the dominant negative first term. In practice, k may be too small, and the 
gap between the redundancy of the first scheme in (84) and that of the second scheme in (92) will 
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be noticed. A mixture of the assigned probability of the new scheme and of those for known small 
fc's in (82)-(83) can be used to achieve the performance of (84) for every k. 

Under the assumption leading to (85), the worst case modified redundancy of the first scheme is 
obtained when k = n 1 / 3 , where the extra number of bits required beyond — log Pml {% n ) is linear in 
k. The additional terms of the redundancy of the second scheme shift the maximum redundancy to 
a larger value of k, yielding larger redundancy. For example, if k — > oo, a value of e = 0.1 will attain 
the worst k (under the assumption leading to (85)) at k « n o.5/(i.5-e) _ n o.5/i.4 ^ n o.357^ w hi c h i s 

larger than n 1 / 3 . If n is not as large, the maximal redundancy will be attained for finite fc's, and will 
increase w.r.t. k. For example, if n = 10 6 , and e = 0.1, the worst case k is slightly above k = 400, 
which is approximately n ' 44 . Figure 5 shows the un- normalized modified redundancy bounds (in 
bits) of both schemes (using a second order term of 1.5(log e)k/n, and for the first scheme with a 
known k), as well as the individual modified redundancies obtained using the two proposed schemes 
for patterns of actual sequences x n . The results are shown for n = 10 6 , for alphabets of sizes k = 2 
to k = 1000, and for e = 0.1 in the second scheme. For the second scheme, the results are also 
shown for the worst possible sequence, i.e., the one that is used to obtain the bound of Theorem 5, 
in which all the k letters occur in the first k symbols of x n . The figure shows that the bound 
in (85) is tight. As expected, the first scheme performs better than the second. The bound of 
(92) is loose because the proof of Theorem 5 makes a loose assumption in order to use Stirling's 
bounds. The algorithm is thus much better than the bound in (92). Since the performance is for an 
individual sequence, the simulation curve for the second scheme is rather noisy. The reason is that 
the behavior varies depending on where in the sequence first occurrences are. Since each point is 
for a different individual sequence, the locations of the first occurrences vary. Figure 5 also verifies 
the worst values of k mentioned above. 

For a given sequence x n , the schemes described in this section assign probability based on 
only a single permutation of 0. However, a pattern probability can be expressed as a sum of all 
permutations of its ML estimator. Naturally, if the probabilities of all k\ permutations are included 
in the assigned probability, better redundancy can be obtained. Subsequently to the derivation 
of the schemes described here, a class of computationally more demanding schemes that accounts 
to many such permutations was obtained and described in [18]- [20]. Unlike those schemes, the 
methods proposed here can more easily be integrated into efficient low-complexity implementations 
of adaptive arithmetic coding (see, e.g., [26]). 
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Figure 5: Bounds and simulation results for the individual modified pattern redundancies of two 
sequential schemes with n = 10 6 and e = 0.1 

8.3 Pattern and I.I.D. Entropies 

The description length in (84) can lead to an upper bound on the pattern entropy in terms of the 
i.i.d. one. In particular, it follows from (84), that if for an arbitrarily small e > 0, the probability 
that at least k' distinct symbols will occur in a sequence of length n is at least 1 — e, then 



The last equation can be shown by bounding the entropy by the average description length of the 
following probability assignment code. The code assigns the pattern ^ (x n ) a codeword of length 
— logPg (x n )] < — logPfl (x n ) bits if less than k' indices occur in ^ (x n ). Otherwise, it uses the 
code leading to (84). One negligible bit is required to distinguish between the two cases, and at 
most O(logn) bits are needed to inform the decoder of the actual number of indices if greater than 
or equal k! . Hence, for a large k' , the pattern entropy is significantly smaller than the i.i.d. one. 
In fact, from (93) and the bounds derived in this paper, we observe that not only does the pattern 
entropy decrease significantly from the i.i.d. one, but also the true pattern redundancy becomes 
negligible compared to this decrease. The extensive study of the pattern entropy has been the 
subject of several subsequent works, first in [34], and later in [11]-[12], [22]-[23], [31], [38]-[39]. 



-H e [tf (X n )} < < 



He (X) ; 
^ -(!-£)§£ log 



if k' < e 19 / 18 • n 1 / 3 

_ +Q ( fcgffigg ) • if k > > e 19/18 . n l/3 



(93) 
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9 Discussion 



This paper considered the average case of the problem of universal pattern compression. Both lower 
and upper bounds on the redundancies of codes for this problem were obtained. However, a gap still 
exists between the lower bounds and the attainable upper bounds. While we considered the average 
problem, other work [1], [13], [17]-[21] considered the individual sequence case, using different 
techniques based on combinatorics. Although the aim and the techniques in those independent 
works were different, similar qualitative results were obtained, and in particular the same gap was 
shown to exist between the orders of magnitudes of the lower and the upper bounds. 

Future work should try to bridge the two sets of bounds. In our work, it is clear that there is 
room for improvement and tightening both lower and upper bounds. The minimax lower bounds 
derived in Section 5 are not tight because we decreased the grid size dividing by k\, which eliminated 
many permutations of ip(6), more than once each. In Section 6, the assumption that complete 
spheres are contained in the pattern space, and the division by their complete volume to find the 
number of spheres packed in the space resulted in a possibly loose bound for most sources. It may 
be possible to use techniques from combinatorics to tighten the bounds. The question is whether 
such techniques will improve the first order asymptotics or not. 

On the other hand, the quantization approach of the upper bound in the first region may be 
useful also for the second region with larger fc's. The derivation of the upper bound of the second 
region does not quantize the estimators of the probability parameters, possibly leading to a loose 
bound. One can show that quantization of the ML parameters into the vector ip whose components 
are on the grid points of r defined in (68) can result in representation cost of O (n( 1+£ )/ 3 ) even for 
large fc's. More precisely, let (3 be some partitioning index in the grid r. Consider representing the 
quantized ML pattern probability parameters of <p as follows: For each of the first (3 grid points 
in r use up to (1 + e) log k bits to represent how many letters have probabilities quantized in (p to 
this grid point. In the remaining grid points (bounded by B = ^Jn l+£ ) there are at most n 1+£ / /3 2 
quantized probability parameters in the components of <p (see, e.g., (68)- (69)). For each of these 
components of <p, one can use up to (1 + e) logi? bits to represent the index in r of the point that 
equals this component. This results in a total representation cost upper bounded by 

(l + 5)p\ogk + (l + 5)^logn (94) 

for some fixed 8 > e. Differentiating w.r.t. (3 to find the value of the partition point (3 of r that 
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yields the minimum of the expression above yields a bound on the representation cost of 

(1 + 5) 1.5 • n^ 1+£ ) • (logn) 1/3 • (logfc) 2/3 (95) 

bits for coding <p. A more complex analysis takes the second cost above as the number of choices 
of at most n l+£ / 1 elements out of at most y / n 1+e with repetitions allowed. Asymptotically, it 
yields the bound of (95) divided by a factor of 3 1 / 3 . Considering the bound only for k > n 1//3 , this 
stronger bound can be bounded by (1 + 5) 1.5 • ns( 1+£ ) • log k. 

From (95) we know that the representation of a quantized version of the pattern ML estimator 
whose components are quantized as proposed in Section 7 can cost O (n( 1+£ )/ 3 ) bits for every k, 
including all large values of k up to k = n. However, the quantization cost, in this case, increases 
and becomes of O (/c/n 1+£ ) bits per symbol. While subsequent work in [35] has already improved 
the upper bound by trading off between the two costs, a gap to the lower bound still remains. 
Therefore, further research should explore this direction more in order to attempt to reduce the 
upper bound that consists of these two components even more. 

10 Summary and Conclusions 

We studied the average universal coding problem of patterns of sequences generated by i.i.d. sources. 
Lower bounds on the average minimax redundancy and the redundancy for most sources were ob- 
tained, as well as upper bounds obtained for specific codes. It was shown that for essentially small 
alphabet sizes, the redundancy cost in coding patterns is between 0.5 log(n//c 3 ) and 0.51og(n/fc 2 ) 
bits per each unknown probability parameter in all average senses. For essentially large alphabets, 
this cost is between O (n 1//3 ) and O (n 1 / 2 ) bits overall. These redundancies are better than those 
attained in standard i.i.d. sequence compression. In particular, for large k's where universal com- 
pression with vanishing redundancy is impossible in the i.i.d. case, here it has vanishing redundancy. 
The gain over i.i.d. compression increases with k, since for large k's, a fixed cost is maintained, 
regardless of the value of k. This gain is reflected even more in the existence of universal pattern 
codes whose pattern universal description length is smaller than the i.i.d. non-universal MDL of the 
underlying sequence if the alphabet is large enough. This implies a decrease of the pattern entropy 
w.r.t. the underlying i.i.d. one. This overall gain, of course, does not come for free, and the cost 
is embedded in coding the unknown alphabet characters before the patterns are obtained. Two 
low-complexity sub-optimal sequential algorithms were presented and were used to demonstrate the 
gain in coding patterns over the i.i.d. case. Future work should attempt to bridge the gap between 
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the upper and lower bounds. Also, the results on the i.i.d. problem can serve as a basis for further 
research on pattern compression for patterns induced by non-memoryless sources. 



Appendix A — Proof of Lemma 5.1 

The number of vectors b with integer components that satisfy (36) is lower bounded by the number 
of cubes with edge 1 that are fully contained in the positive quadrant of the k — 1 dimensional 
sphere centered at the origin with radius y / n 1_e . (This is a lower bound, because there are, in fact, 
more vectors with zero components than this number). The number of these cubes equals the total 
volume of these cubes. However, since there can exist cubes that are only partially contained in the 
sphere, the volume of the sphere with radius \/n l ~ s cannot be used to bound the total volume of 



these cubes. Instead, we can subtract the longest diagonal of these cubes \Jk — 1 from the radius 
of the original sphere, and use the new sphere with radius y/n 1 6 — \Jk — 1 to bound the total 



volume of these cubes, or even use a shorter radius \fn 6 < \/n l £ — \Jk — 1 for this bound (this 
radius is shorter by the assumption that k < n 1 " 26 <C n 1-£ ). The volume of the positive quadrant 
of this sphere is bounded in (37). It is thus only left to show that all cubes that are only partially 
contained (or are not contained) in the sphere with radius y/n 1 ^ 6 are completely outside the sphere 
with radius \/n l ~ s — \Jk — 1. Hence, the volume of this sphere is a lower bound on the total volume 
of cubes for which the farthest points from the origin satisfy (36), and thus a lower bound on the 
number of integer components vectors b that satisfy this condition. 

Let a = (ai, 02, . . . , afc_i) be the farthest point from the origin in an edge 1 cube that is either 
partially in the sphere with radius \pn~ 6 centered at the origin or not in the sphere. By definition, 



5>i > nl ~ 



fc-i 



By Jensen's inequality on the function x 2 , 

fc-i /fc-i \ 2 , /fc-i \ ' fc-i 



fc-i /fc-i \ L 1 /fc-i \ z fc-i fc-i 



(A.2) 



The nearest point to the origin of the cube considered is the point (a± — 1, a-i — 1, ... , at-i — 1). 
To prove that the cube is completely outside the new sphere, we need to show that this nearest 
point to the origin is outside this sphere, i.e., that Yl ( a i ~ I) 2 > P 2 where p = \/n l £ — \Jk — 1 is 



48 



the radius of the new sphere. The difference between the two sides of this equation is 



fc-i 



£ {a t - If - (W 



i=i 



l-e 




'k-l 



2 a i - VA; - ly/n 



l-e 



1 

l-e 



> 0. 



(A.3) 



The first inequality is obtained by using (A.l) for the first term and (A. 2) for the second. The 
next inequality is because the left term is positive as long as k < n 1 ~ e /4 + 1, and the right term is 
positive by (A.l). This proves that any point in a cube that is not completely inside the sphere with 
radius \fn~~ z must be outside the sphere we defined with a smaller radius, and thus the volume of 
this sphere in the positive quadrant lower bounds the number of nonnegative integer components 
vectors that satisfy (36). This concludes the proof of Lemma 5.1. □ 



Appendix B — Proof of Lemma 5.2 

Let the observed data sequence X n be generated with distribution Pg(x n ), where G Q. Let 6 be 
the ML estimate of from X n . To bound the probability of event A, we will use the union bound 
on events Aj. Define 

Si = 9i- 9 t . (B.l) 

As defined in (41), event A{ for 1 < % < k occurs if |<5j| > A (t&J /2, where A (r^) is as defined in 
(33). Recall that for % < k, tj,. = 9{. For i = k, the constrained parameter 6k may not be on r 
and we define Tf, fe as the nearest point in r to 6k that is smaller than or equal to 6k- Note that in 
order to generate a bound that can be useful for the distinguishability of patterns, we must bound 
Pg(A), which is greater than Pg / 6^j. The latter is sufficient for distinguishability in the i.i.d. 
case (see, e.g., [30]). In particular, we must include Ak in the error event, although we can use the 
assumption that 6k > 6i, for alH; 1 < i < k — 1. Hence, for all i, 6i > l/n 1_e from the definition 
of the minimum grid point in (31). 

In the following lemma, we lower bound |<5j| as a function of 6i given event Ai occurred. Following 
the lemma and its proof, we use this bound and the union bound on the components of 6 to show 
that the overall probability of A vanishes. 
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Lemma B.l // event occurs, then event 



Bi : \5i 



must occur as well. 



> 



l(Vn 



l-e 



(B.2) 



Note that if 9i = 0, Lemma B.l holds simply because the right hand side of (B.2) is zero. 

Proof: First, extend the definition of the function A(0), defined in (33), to every value of 9 > 
l/n l ~ £ , 



A (9) 



A 



n 



l-e 



> r l-e- 



(B.3) 



The function A (9) is increasing in 9. Now, given Ai occurred, consider two separate cases: (1) 
0i < r bl +2, and (2) 0, > r bi+2 . For case (1), 



2 " 10 " W^i 1 - 6 ~ lO^ 1 
To obtain the second inequality, we use the fact that A (T^+2) < 5A (r^), which can be shown by 
observing the case 6j = 1. Then, inequality (B.3) is used. Finally, the definition of this case leads 
to the last inequality. (Note that we need to consider u i+ 2 instead of t^+i only in case 6k is closer 
to Tb k+ \ than to Tb k , resulting in 9^ > n k+ i still satisfying the complement event to A^.) For case 
(2), let bi be the index of the largest grid point still smaller than §i, i.e., 9{ > rg.. Then, since there 
is more than one unit of grid spacing between B\ and 9i, 



-l-e • 



(B.4) 



> A 



V b 'J r0-~ e ~ n 1 " 2 2~2 



> 



2v^ 



l-e ' 



(B.5) 



□ 



The second inequality is obtained since b\ > 2. This concludes the proof of Lemma B.l. 

Using Lemma B.l and the union bound, 

k k 
Pe(A)<Y J Pe(A t )<Y, p e(B t ), 

i=l i=l 

and we need to bound Pq (Bi). Consider the Bernoulli n-sequence Yj, whose jth symbol is defined 
by 



(B.6) 



Yij = < 



1, if Xj = i 
0, otherwise 



(B.7) 



where Xj is the jth symbol of X n , and we assume, without loss of generality, that the k alphabet 



A 



letters are l,2,...,k. Let Pg i (Yj = y n ) be the probability that Yj takes value y n = (yi,V2, - ■ ■ , Un), 
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where the symbols yj can be either or 1. Let P^ be the empirical distribution of y n which was 
drawn by Pg i (Yj), i.e., the Bernoulli probability mass function induced by the ML estimator $i 
of 9i on the random vector Y«. For a given value of the random sequence X n , the sequence Yj, 
defined in (B.7), will give the exact same ML estimator Oi as the one obtained from X n . Therefore, 
by typical sequences analysis (see [3], [4]), 

P e (B t ) = P e% (Bi) < n • 2 -™^ Sl d(p s . II p e .) ^ (Rg) 

where D ^P^ \\ Pg) is the divergence (relative entropy) between the two distributions, and the 
coefficient n bounds the number of possible different n-sequence types, for which event P>i occurs. 

We now need to lower bound D (^P^ \ \ Pg) given event P>i has occurred. This is done as follows: 
First, let us define the function 



A \\ 0<x<l, 
= < 

(l-ln2)x; x > 1. 
Using Taylor series expansions, it can be shown that 



(B.9) 



log(l + x) > [— x + f(x)] loge, if x > 



-log(l-x) > (x+yjloge, if < x < 1. 



(B.10) 
(B.ll) 



We will use these inequalities in the following derivations. Given P>i has occurred, for Oi > 0, 

1 - Oi 



D(P § . II P e ) = dilogf + (l-^)log 



1 



> 



0i + 5i 



= 9 i \og-J^-+(\-oMog-±- 

Vi — Oi v 7 1 — 



> < 



log e 
loge 

log e 
loge 



28? 



+ f 



; if Si > 0, 



20, 



if Si > 
otherwise 



loge . 
200n 1 - £ ' 
log e . 

( '°a^ 2) ; if*<0andf >1. 



if <5j < and < l M < 1, 



if <5j > 
otherwise 



(B.12) 
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The first inequality is obtained by applying (B.lO)-(B.ll). (Note that it is true also in the limits of 
Oi — > and Oi — > 1.) Then, the first order terms cancel each other out. Finally, since all remaining 
terms are positive, we only use the first term in each case with the definition of event Bi in (B.2) 
to obtain the last inequality. In the third case, we also assume that a nonzero ML estimator must 
satisfy Oi > 1/n to obtain the bound. For Q; L = 0, 

D (p § . 1 1 P 6l ) = - log (1 - Oi) > Oi log e > (B.13) 

where the first inequality is obtained since — ln(l — x) > x for < x < 1, and the second by the 
definition of the minimum grid point in (31). 

We can now plug the lower bounds on D (^P^ \\ Pg.^J in (B.8) to bound Pg (Bi) 

Pe (Bi) < n ■ 2~ n '^^ = 2 lo s™- c ™ £/2 , (B.14) 

where c is a constant that is the minimum over all the cases described above. Finally, by the union 
bound in (B.6), we obtain 

P e (A)<k- max {P e (Bi)} < 2 ( lo g fc )+(i°gn)-cn-/ 2 ^ o. (B.15) 

i 

This concludes the proof of Lemma 5.2. □ 



Appendix C 



Proof of Lemma 6.1 



Let X n be the observed random data sequence, which was generated by point on the uniform 
random grid. Let 6 be the ML estimator of 6 from X n . Let Si be defined as in (B.l). Then, for 
the event in (62), we have 

k-i j 



e-e 



> 



in 



-l-e • 



(C.l) 



As in Appendix B, we will show that the event in (C.l) is a union of events, and use the union bound 
on these events to bound the error probability. However, here, the events are more complicated. 
We start with a lemma, that will be used to define the events. 



Lemma C.l Let n — > oo and let 9 and 6 satisfy (C.l). Then, there exists j; 1 < j < k 
min |2n 1_£ / 2 , k — l}; such that for at least j components Oi of 6, 



(9i-ei) 2 >—^ 

\ J jn L ~ 



e/2- 



(C.2) 
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Proof: Let T contain all the indices i for which either 6*j > l/n 1 ^/ 2 or 0j > l/n 1_£ / 2 . The 
cardinality of T is bounded by | T\ < 2n 1 ~ £ / 2 . We separate the components of 6 in T from those 
outside of it. For i £ let 6*j = ai/n l ~ £ l 2 and ^ = dj/n 1-5 / 2 , where < < 1 and < dj < 1. 
The contribution of alH G .F to the sum in (C.l) is negligible, and thus the event defined in (C.l) 
depends mostly on i £ T . This step is necessary to show that distinguishability includes also 
sources with more than k m letters for larger k. 

Now, assume that no j as defined above exists. Then, for every i, 



(Oi-Oi) <- riJ - 2 . (C.3) 

Then, for every component i, but one, 



and for at most one component 9i of 0, 



2n 1 - £ / 2 v / n 
Next, there are at most two components 0j of 0, for which 

but for at least one of them, (C.4) must also be satisfied, and for both (C.3) must be satisfied. We 
can proceed this process up to j = min {k — 1, 2n 1 ~ £ / 2 } < 2n 1 ~ £ / 2 . Using this process, (C.3)-(C.6), 
and the following similar equations that can be obtained for larger j's, we can obtain the upper 
bound 



0-0 



2n 1 ~ e 

2 = E(**-ft) 2 +E(*-ft) 2 < £* 2 + £* 2+ £ ! 



n . n l e/2 

idT idT i€F J=l 

< > "Vr + > "Vt + "I To 1 + / ~ dx 

Z_, n 2-e n 2- £ n l-e/2 I A x / 

a,- v-— v d,- 1 



< E^ i i + E^ i 7 + V^2 1 4 e f 2 - 1 " £/2 + 1 ) 

n 2-e Z-/ n 2-e n l-e/2 L V / 



iG.F iGj 17 

2 n £ / 4 1 

n x-e/2 n l-e/2 n l 



< -,l- e /2 + „l- e /2 < n l-e- ( C - 7 ) 



The first inequality is by applying the above relations and by bounding the square distance for 
small probabilities by the sum of their squares. The third inequality is since af < on since aij < 1 
for t £ f , and the same applies for dj. The next inequality is since Yli^T ai — nl ~ 6 ^ 2 
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YlieF@i — 1> an d again, the same is applied for Qj. In addition, we apply n — > oo to bound the 
last term for this inequality and for the last inequality. Inequality (C.7) contradicts (C.l). This 
concludes the proof of Lemma C.l. □ 

Before we use Lemma C.l, let us partition into the set 0~ containing all letters with 6i < 
l/n 2+£ and Q + , containing all the remaining letters. This is, again, necessary for the case in which 
k > k m . Let event T contain all x n for which any of the letters in 0~ occurs more than once. We 
can now use (C.2) to define event Aj as all sequences x n for which there are (at least) j components 
6i of + , for which (C.2) is satisfied. Thus, 

( ^ k" 

Pel 0-0 > -±^\<P e (T)+Y i P e (A j ), (C.8) 
I J j=l 

where k" = min{|0 + | , k'} < 2n l ~ £ l 2 . Inequality (C.8) is because if (C.l) is satisfied, either Aj 
occurs for some j, or there exist components in 6 for which (C.2) is satisfied. For such components, 
the occurrence of (C.2) means that the letter occurred (significantly) more than once. 

First, let us bound the first term of (C.8). The probability that letter % G 0~ occurs in X n is 
given by 

P e (i € X n ) = 1 - (1 - OiT > nOi - (j) 6l (C.9) 
The average re-occurrences (beyond the first occurrence) of such a letter is then upper bounded by 

EN X (i) -Pg(i€ x n ) < (§) el (CIO) 

where EN X (i) is the expected number of occurrences of letter i. Then, the average re-occurrence 
of any of the letters in is bounded by 

_«? . 1 

r 



E (o -Pede x-) } <(;)£^<yE^4^ - (c - n) 



where 9i = Qj/n 2+£ , a, < 1, and the last inequality is obtained similarly to the derivation in (C.7), 
where Y2i a f < Yli a * < n 2+£ . Using Markov inequality, the probability of T is bounded by the 
bound above, i.e., Pg (T) — > 0. 

Event Aj in (C.8) is the union of all events for which any j components of + satisfy (C.2). 
This applies to any choice of j components out of k = \0 + \ < n 2+£ . Let Aj\ be the event in which 
the j components of the Ith choice out of 

L < (f\ (C.12) 
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choices of components of + satisfy (C.2). Using the union bound, again, 

L 



p,(^)<E p e(^)< 



k 



1=1 



maxPg) (Aji) < k 3 • maxPg (Aji) . 



(C.13) 



To bound Pq (Aji) for given j and I, let us define a transformation of the alphabet of to the 
alphabet T,ji with cardinality j + 1. The j letters denoted by u±,U2, ■ ■ ■ ,Uj, whose ML estimates 
satisfy (C.2) will be numbered from 1 to j, and all other letters will be transformed into the letter 
j + 1 € Ejj. Let Yi be an n-dimensional transformation of X n that takes a value y n , such that, in 
a similar manner to (B.7), 



Ylm 



if X„ 



Ui, 



(C.14) 



j + 1, otherwise 

Let (fi = 6 Ui for 1 < i < j, be the probability of Y\ m taking the value i, where fj+i is the sum of 
all the remaining probability components of 0. Let be the ML estimate of (pi from the vector 
Y/. Let Lp be the j dimensional vector that defines the i.i.d. distribution of vector Y/. Since the 
probability of Aji depends only on the j parameters that satisfy (C.2), we can now use the new 
parameter vector ip, which is a permutation of these j parameters with all other components of 6 
condensed into one probability parameter, to bound this probability. By typical sets analysis, 

in y™eA, ; d(p$ II P v ) 



P e = P v {Aji) <{n + 1) J 2 



(C.15) 



where the polynomial coefficient is a bound on the number of types. To bound the expression in 
(C.15), we can lower bound the divergence in its exponent. Let U\ be the set of components of <p 
for which <pi> ipi, and U2 the set for which (pi < <pi. Also define Si now w.r.t. tp and (p. Then, 



D(P,p\\P v ) = E ^ lo S-+ E ^ lo S 



(Pi 



E*.>o g (i-|)-E/.i° g (i-| 



> log e 



log e • < 



E & 



Si 

— + 

_(pi 2(p: 



S} 1 



%2 



+ E & 



<Pi V fi 



well! Lf>% <p^u 2 V Lp% 



(C.16) 



The inequality is obtained from (B.10) and (B.ll) and the definition of the function /(■) in (B.9). 
The last equality is since all the first order terms cancel each other. Now, define the set U[ as the 
union of all components (pi, 1 < i < j, in U\ and these components in U2 for which \Si\ < fi, and 
U' 2 as the set of all the remaining components in 1/2- (Note that we extract fj+i from both sets.) 
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Assume that there are aj, < a < 1, components in U' 2 . Then, by definition of / (•) in (B.9), and 
since all j components in both sets satisfy (C.2), we obtain from (C.16), 

loge ^ l°g( e /2) 



D{P t \\P r ) > 



> 



> 



(loge) (1 - a) j 



l-e/2 



4j n l-e/2 



En 



1 



<f>i&J[ 



, (1 - a)j & 



1 + o_v(Jlog( e /2) 



(loge)(l-a) 2 j 2 | 4aj 2 log(e/2) > 



n 



l-e/2 



(C.17) 



The first term of the third inequality is obtained by Jensen's inequality over the convex function 
1/x, and since the sum on all G U[ is not larger than 1. The second term is obtained since 
j < 2n 1 ~ £ / 2 . The last inequality is obtained since the expression is greater than for every value of 
a, and we can choose a proper constant c > 0, for which the inequality is satisfied. (We note that 
the derivation above also applies in the limit if there exist components of whose ML estimates 
are 0.) Combining (C.8) and the bound on Pq (T), (C.13), (C.15), and (C.17), we conclude that 

k" 



P ft 



e-e 



> 



1-6 



< 



rr 



_|_ 2 -i-[cn £/2 -log(n+l)-logfe] 
3=1 



< J_ + 2 -[cn £/2 -log(n+l)-(2+ £ )logn-log(2n 1 - e / 2 )] ^ q_ (C.18) 

~ n £ 

This concludes the proof of Lemma 6.1. □ 



Appendix D — Proof of Lemma 7.1 



Let us first bound the logarithm of the ratio between the probability given by the parameter vector 
4> and the ML probability of X n . Similarly to (B.lO)-(B.ll), if x < 1, then 



log (1 - x) < (log e) • [-x - f'(x)] , 



(D.l) 



where 



Using the above, 



log 



/'(*) 



x . 

2 ' 



x . 

4 ' 



if x > 0, 
if > x > -1, 
(l-ln2)x; ifx<-l. 



log]] 



l 



n 



i=l 



Si 



(D.2) 



(D.3) 
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< (log e)n^2§i 



i=i 



(loge)n^/' 



(log e)n^2§i 



■ < 



i=i 



it 

it 
40?' 



< 



< 



< 



(log e)n^2§i 

i=i 

(log e) n ^ 

ieJ 



(log e) n ^ 



4e 2 



(l-ln2)| 
(l-ln2)|i 



4 J ' r jl-e/4 ) 

(l-ln2U 



fc 



i=i 

if | ^ °' 
if > |i > -1 

Pi 

if 4>i < 2§i, 
if 4>i > 28i 

if & < 2§i, 
^jpz- if -7, > 20, 



ieJ 



4jn 1 ~ £ / 8 



< 



4(ln2)' 



(D.4) 



(D.5) 



(D.6) 



(D.7) 



(D.8) 



The inequality in (D.4) is obtained from (D.l), and the equality since the summation on all Si must 
be zero. The boundaries in (D.6) are obtained from the definition of <5, in (74). To obtain (D.7), 
we bound all (negative) elements of the sum for which i J by zero, and all elements i G J using 
(75). Then, to obtain (D.8), we take the maximum over the different regions, and also use the fact 
that by the definition of a /c-dimensional i.i.d. ML vector, 0j > 1/n. The last inequality follows the 
fact that there are at least j elements in J. 



Taking the bound of (D.8), we obtain 



k\P^ (X n ) 



< k\ ■ exp < — 



kn e/8 



< exp 



n 



e/8 



-In A; 



0. 



(D.9) 



This concludes the proof of Lemma 7.1. 



□ 



Appendix E — Proof of Lemma 7.2 

To prove Lemma 7.2, we express the logarithm of the desired ratio as a function of the components 
of (p and of distances between components oiip(cr), <p(cr), and 0. First, we bound distances between 
corresponding components of the three vectors under the assumption that tp (a) A, and use these 
bounds to bound the logarithm of the ratio in (78). Let S(a, b) = a — b be the difference between 
a and b. Then, by definition of cp as the quantized form of if), quantized onto points in r, and by 
definition of r, we must have for every i, 1 < i < k — 1, 

\S {fa <pi)\ < A (t % . )+1 ) = < nl+e = ^=i+f> (E.l) 
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where A (•) is defined as in (33) but w.r.t. r denned in (68). The first inequality is obtained since 
either tp G [r 6(¥ ,. ) _ 1 , t^.) = ipi] or tp G [r^.), t 6(v ,. )+1 ] . In either case, ^ is at most A (t 6(v ,. )+1 ) 
away from The last equality is obtained using a similar equation to (32) where — e is replaced 
by e for the proper grid. The distance between the last kth components of -0 and can be bounded 
similarly by 

9 5. /Toil 7 9 Pi . /TnZ 

(E.2) 



|d (^fc,¥»fc)| < ^ < 



^l+e — ^1+e " 

This is because of the procedure used to quantize -0 into if, that ensures that the absolute value 
of the cumulative difference between the components of ip and those of <p is minimized, and is 
therefore bounded by the maximal spacing around the largest free component. 

From Lemma 7.1, in order for -P^(cr) (X n ), the probability of X n that is given by a permutation 
ip(cr), not to be negligible w.r.t. the ML probability of X n , ip(cr) must have, for every j, no more 
than j — 1 components for which (75) is satisfied (where Si is replaced by S 9i,ip(<Ji) , and fa by 
V>(<7j)). This implies that if a permutation i/>(cr) of if> is not negligible, it must have at least k — j + l 
components for every j, 1 < j < k, that satisfy 



< < 



3 y/n 

n 



l-e/4 I 



if V(^) > 20;, 
— - ifV(^)<2^. 



(E.3) 



Hence, in the worst case, there is one distance component for which the tightest upper bound is 
obtained from (E.3) with j = 1, one for j = 2, and so on, up to j = k, i.e., for each j, the inequality 
is satisfied for a distinct component i. Conversely, for the worst case, we can denote the distinct 
value of j for each i as a function of i and of the two vectors 6 and ip (<x), i.e., as j (j), i\) (<x) , i^J . 



We can now express S 9{, ip(a. 



as 



§i - (p(cTi) =9i- ip((Ti) + V(o-j) - 



,0(<7j) +S[%l){oi),<p{ai)\ 



(E.4) 



By the triangle inequality, (E.l), (E.2), and (E.3), if ip((r) g" A, for the k — j + l components of 
ip(<Ti) that satisfy (E.3), 



0i,<p(<7i) 



< 



0i,Y>(<7j) +\S{tp((T i ),cp((T i 



< 2 • max ■ 



| 5 0i,ip((Ji) ,\S[ip(ai),ip(ai)]\j 



< 



; if 9i > 2<p(<n). 



-l-e/4 



(E.5) 
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The first region is obtained by combining the worse bound of (E.3) with that of (E.l). The bound 
in the second region is true because 26i > 4<p(<7j) > since it can be shown by definition of r 

that always VK *) ^ 3<^(<7j). 

The first region of the bound in (E.5) is expressed as a function of <p (<Tj). However, the second 
region is in terms of #j. In order to obtain the bound of (78), we need to express both bounds in 
terms of </?(<7j). Hence, we need to first bound the second region of (E.5) in terms of ip{&i), or 
alternatively bound \fo~i in terms of yj 'ip (crj). To achieve that, we observe that ip(<Ti) is smaller 
than half Oi. This means that we represent the i.i.d. ML probability component Oi using ip(<r) by 
a probability that is roughly smaller than its half (since tp{ a i) is asymptotically much closer to 
ip(<Ti)). If 6{ is large, this must yield a negligible probability because ip(<Ji) will be too far from 0j. 
Therefore, there must be an upper bound on B% for which the second region of (E.5) still applies 
while i\) (a) A. By the bound in the second region of (E.5), we must have 

5V% 



> 



,<P(cTi 



\Jn V J 

Hence, by rearranging terms of the last inequality, 

WOk 



Oi - <p((Ti) > 



(E.6) 



Oi < 



jn 



l-e/4" 



(E.7) 



Now, we need the following lemma. 



Lemma E.l Let k = k < ^fn 1 6 , and let £ > be arbitrarily small. Then, for all i; 1 < % < k, 



Vi > (1 - 0/n. 



(E.8) 



Proof: Let Ok = n x (k)/n be the maximal component of 6, where n x (k) is the occurrence count of 
the respective letter. Then, first, we must have ipk ^ (1 — £/2)0fc. Otherwise, ip (6) G A, and cannot 
be the pattern ML estimate, using Lemma 7.1. This is shown below. Assume tp^. < (1 — £/2)^. 
Then, 



CO, _ t\ Ou 

n^ k )> — >^ I y I 



> 



C r-3s/i 



i- £ /4 2 



> 



l-e/4' 



(E.9) 



The second inequality is since Q\ > 1/k > l/y / n 1 ~ e . The next inequality is again by the assumption 
that k < y / n 1_e . The right hand side above shows that if ^ < (1 — £/2)0fc < 20,, the condition 
of Lemma 7.1 is satisfied w.r.t. ip{9), thus contradicting the fact that ip (6) is the pattern ML 
probability vector. 
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Using the fact that V'fe > (1 — £/2)#fc, we now show by differentiation that {X n )\ attains 

its maximum w.r.t. fa for fa > (1 — £/2)/n. First, from (73), 



dfa 



(E.10) 



where n x (<7j) is the permuted entry of the occurrence vector at index i. This derivative is a weighted 
sum of decreasing functions in fa, each attaining the value at 



n x (&i) , , „ , , n x (k) n x (ai) 1 — £12 
fa = )_' fa > (1 - f/2) • v ; > — . 



(E.ll) 



n x (a k ) n n x [a k ) n 

where the first inequality is since Vfc > (1 — £/2)0&, an d the second is because n s (<7i) > 1 and 
n x {k) > n x (crjt). Finally, by the quantization of V>i to <pi (using the definition of r) and the ordering 
in vector we obtain ipi > <p± > (1 — £)/n. □ 

Using Lemma E.l (in particular, (E.8)), we can now bound \/~6~i for the second region of the 
bound in (E.5). From (E.7) and bounding, we have 



'Oi < 10 



3 y/n 



■l-e/4 



< 10 • 



(i - Oj 



n 



We can now bound the logarithm of the desired ratio. This is done below: 
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(E.12) 
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Equalities (E.15) and (E.17) are obtained by using ipfa) = </?(<Jj) + <5 [^(<7j), ¥>(<7j)] and Qi = (p(<Ti) + 
(5 9i,ip(<Ji) , respectively. Inequality (E.16) is true because ln(l + x) < x for x > —1. The sum on 
all displacements of one distribution w.r.t. the other is zero, yielding (E.18). Then, the bounds in 
(E.l), (E.2), and (E.5) result in (E.19), where we use the worst case defined following (E.3). Then, 
we rearrange the sum by j instead of i and use (E.12) to obtain (E.20). The last inequality of 
(E.20) is obtained since X^=i Vi — m (e(fe + 1)). This concludes the proof of Lemma 7.2. □ 



Appendix F 



Proof of Theorem 4 



The individual modified redundancy of the code defined in (82)-(83) is obtained by 



nR n [Q k , * (x n )] = - log Q k [* (x n )] + log P ML {x n ) 



(F.l) 



From (82)- (83), it can be observed that 



Q k [* (x n )] = k\ ■ Qkt (x r 



(F.2) 



Therefore, the individual modified redundancy of this code is log(/c!) bits less than the i.i.d. redun- 
dancy of the KT code, which is well known. This yields the bound of (84). However, for the sake 
of completeness, we show the main steps of the derivation of the bound from Q k (x n )] itself. 

By definition of Q k [* {x n )\ in (82)-(83), 

[2n*(j)]l 



-logQ k [^(x n )} = { 



-log 
-log 



:i- i ) | - fc! Y\ k 



( n+ |_l)i llj=l 2 2 ^0')[n x (j)]! 



(fc-l)!-(n+^)!-2 2 "+ fc - 1 -fc! 
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{2n x (j)]l 



for even k, 
for odd k, 



(F.3) 



(^i) ! .(2n+fc-l)!-2 fc - 1 il i= 1 2 2 »x0)[ nx (j)]! 

where n x (j) is the number of occurrences of index j in the pattern \I/(x n ), and the k\ factor 
is the only different additional factor to the expression above beyond that of the standard KT 
probability. The terms to the left of the product on the right hand side of the equation (except 
the k\ term) are the result of multiplying the values of the denominator at all time points from 
1 to n. The product on the right hand side with the k\ term are the result of multiplying the 
numerators. To complete the derivation of the bound in (84), we plug (F.3) into (F.l) to compute 
the redundancy, use Stirling's approximation (38) to upper and lower bound factorials, use the 
relationship ln(l + x) < x, and combine similar order terms. The ML i.i.d. probability is reduced 
by the occurrence of the same factors in Q k [ty (x n )] resulting from the product term on the right 
hand side of (F.3). This concludes the proof of Theorem 4. □ 
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Appendix G — Proof of Theorem 5 



To prove Theorem 5, we need to make one key observation. Let 



m 



k (k + l 



2 + — — ■ <Ga > 
Then, we can use (G.l) to upper bound the product in the denominator of Q (x n )] by (n + m — 
l)!/(m — 1)!. This bound bounds each term of the product over the time n by an expression that 
is larger than each such term, resulting in a somewhat loose bound. This is true even for the worst 
sequence in which all the k distinct letters occur in the first k time units, for which the denominator 
of Q (x n )] is maximal. Using this bound, 

Q f¥ (x")l > (m - 1)! - (fc!)1 ~ £ • A (G 2) 

The remaining steps use Stirling's bounds (38) to bound factorial terms, and the bound ln(l + x) < 
x, and then combine similar order terms, eventually substituting (G.l) to express the bound as a 
function of k. Finally, we plug an upper bound on the negative logarithm of Q (x n )] in (F.l) 
instead of Qk (x n )]. The components of the i.i.d. ML probability cancel out, and (92) is attained. 
This concludes the proof of Theorem 5. □ 
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